Artificial Intelligence 184
☆ Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, Pheng-Ann Heng
Recent video generation models can produce high-fidelity, temporally coherent
videos, indicating that they may encode substantial world knowledge. Beyond
realistic synthesis, they also exhibit emerging behaviors indicative of visual
perception, modeling, and manipulation. Yet, an important question still
remains: Are video models ready to serve as zero-shot reasoners in challenging
visual reasoning scenarios? In this work, we conduct an empirical study to
comprehensively investigate this question, focusing on the leading and popular
Veo-3. We evaluate its reasoning behavior across 12 dimensions, including
spatial, geometric, physical, temporal, and embodied logic, systematically
characterizing both its strengths and failure modes. To standardize this study,
we curate the evaluation data into MME-CoF, a compact benchmark that enables
in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our
findings reveal that while current video models demonstrate promising reasoning
patterns on short-horizon spatial coherence, fine-grained grounding, and
locally consistent dynamics, they remain limited in long-horizon causal
reasoning, strict geometric constraints, and abstract logic. Overall, they are
not yet reliable as standalone zero-shot reasoners, but exhibit encouraging
signs as complementary visual engines alongside dedicated reasoning models.
Project page: https://video-cof.github.io
comment: Project Page: https://video-cof.github.io
☆ Gistify! Codebase-Level Understanding via Runtime Execution
Hyunji Lee, Minseon Kim, Chinmay Singh, Matheus Pereira, Atharv Sonwane, Isadora White, Elias Stengel-Eskin, Mohit Bansal, Zhengyan Shi, Alessandro Sordoni, Marc-Alexandre Côté, Xingdi Yuan, Lucas Caccia
As coding agents are increasingly deployed in large codebases, the need to
automatically design challenging, codebase-level evaluation is central. We
propose Gistify, a task where a coding LLM must create a single, minimal,
self-contained file that can reproduce a specific functionality of a codebase.
The coding LLM is given full access to a codebase along with a specific
entrypoint (e.g., a python command), and the generated file must replicate the
output of the same command ran under the full codebase, while containing only
the essential components necessary to execute the provided command. Success on
Gistify requires both structural understanding of the codebase, accurate
modeling of its execution flow as well as the ability to produce potentially
large code patches. Our findings show that current state-of-the-art models
struggle to reliably solve Gistify tasks, especially ones with long executions
traces.
☆ Defeating the Training-Inference Mismatch via FP16
Reinforcement learning (RL) fine-tuning of large language models (LLMs) often
suffers from instability due to the numerical mismatch between the training and
inference policies. While prior work has attempted to mitigate this issue
through algorithmic corrections or engineering alignments, we show that its
root cause lies in the floating point precision itself. The widely adopted
BF16, despite its large dynamic range, introduces large rounding errors that
breaks the consistency between training and inference. In this work, we
demonstrate that simply reverting to \textbf{FP16} effectively eliminates this
mismatch. The change is simple, fully supported by modern frameworks with only
a few lines of code change, and requires no modification to the model
architecture or learning algorithm. Our results suggest that using FP16
uniformly yields more stable optimization, faster convergence, and stronger
performance across diverse tasks, algorithms and frameworks. We hope these
findings motivate a broader reconsideration of precision trade-offs in RL
fine-tuning.
☆ Remote Labor Index: Measuring AI Automation of Remote Work
Mantas Mazeika, Alice Gatti, Cristina Menghini, Udari Madhushani Sehwag, Shivam Singhal, Yury Orlovskiy, Steven Basart, Manasi Sharma, Denis Peskoff, Elaine Lau, Jaehyuk Lim, Lachlan Carroll, Alice Blair, Vinaya Sivakumar, Sumana Basu, Brad Kenstler, Yuntao Ma, Julian Michael, Xiaoke Li, Oliver Ingebretsen, Aditya Mehta, Jean Mottola, John Teichmann, Kevin Yu, Zaina Shaik, Adam Khoja, Richard Ren, Jason Hausenloy, Long Phan, Ye Htet, Ankit Aich, Tahseen Rabbani, Vivswan Shah, Andriy Novykov, Felix Binder, Kirill Chugunov, Luis Ramirez, Matias Geralnik, Hernán Mesura, Dean Lee, Ed-Yeremai Hernandez Cardona, Annette Diamond, Summer Yue, Alexandr Wang, Bing Liu, Ernesto Hernandez, Dan Hendrycks
AIs have made rapid progress on research-oriented benchmarks of knowledge and
reasoning, but it remains unclear how these gains translate into economic value
and automation. To measure this, we introduce the Remote Labor Index (RLI), a
broadly multi-sector benchmark comprising real-world, economically valuable
projects designed to evaluate end-to-end agent performance in practical
settings. AI agents perform near the floor on RLI, with the highest-performing
agent achieving an automation rate of 2.5%. These results help ground
discussions of AI automation in empirical evidence, setting a common basis for
tracking AI impacts and enabling stakeholders to proactively navigate AI-driven
labor automation.
comment: Website: https://www.remotelabor.ai
☆ LLMs Process Lists With General Filter Heads
We investigate the mechanisms underlying a range of list-processing tasks in
LLMs, and we find that LLMs have learned to encode a compact, causal
representation of a general filtering operation that mirrors the generic
"filter" function of functional programming. Using causal mediation analysis on
a diverse set of list-processing tasks, we find that a small number of
attention heads, which we dub filter heads, encode a compact representation of
the filtering predicate in their query states at certain tokens. We demonstrate
that this predicate representation is general and portable: it can be extracted
and reapplied to execute the same filtering operation on different collections,
presented in different formats, languages, or even in tasks. However, we also
identify situations where transformer LMs can exploit a different strategy for
filtering: eagerly evaluating if an item satisfies the predicate and storing
this intermediate result as a flag directly in the item representations. Our
results reveal that transformer LMs can develop human-interpretable
implementations of abstract computational operations that generalize in ways
that are surprisingly similar to strategies used in traditional functional
programming patterns.
comment: Code and data at https://filter.baulab.info/
☆ Clone Deterministic 3D Worlds with Geometrically-Regularized World Models
A world model is an internal model that simulates how the world evolves.
Given past observations and actions, it predicts the future of both the
embodied agent and its environment. Accurate world models are essential for
enabling agents to think, plan, and reason effectively in complex, dynamic
settings. Despite rapid progress, current world models remain brittle and
degrade over long horizons. We argue that a central cause is representation
quality: exteroceptive inputs (e.g., images) are high-dimensional, and lossy or
entangled latents make dynamics learning unnecessarily hard. We therefore ask
whether improving representation learning alone can substantially improve
world-model performance. In this work, we take a step toward building a truly
accurate world model by addressing a fundamental yet open problem: constructing
a model that can fully clone and overfit to a deterministic 3D world. We
propose Geometrically-Regularized World Models (GRWM), which enforces that
consecutive points along a natural sensory trajectory remain close in latent
representation space. This approach yields significantly improved latent
representations that align closely with the true topology of the environment.
GRWM is plug-and-play, requires only minimal architectural modification, scales
with trajectory length, and is compatible with diverse latent generative
backbones. Across deterministic 3D settings and long-horizon prediction tasks,
GRWM significantly increases rollout fidelity and stability. Analyses show that
its benefits stem from learning a latent manifold with superior geometric
structure. These findings support a clear takeaway: improving representation
learning is a direct and useful path to robust world models, delivering
reliable long-horizon predictions without enlarging the dynamics module.
☆ Faithful and Fast Influence Function via Advanced Sampling
How can we explain the influence of training data on black-box models?
Influence functions (IFs) offer a post-hoc solution by utilizing gradients and
Hessians. However, computing the Hessian for an entire dataset is
resource-intensive, necessitating a feasible alternative. A common approach
involves randomly sampling a small subset of the training data, but this method
often results in highly inconsistent IF estimates due to the high variance in
sample configurations. To address this, we propose two advanced sampling
techniques based on features and logits. These samplers select a small yet
representative subset of the entire dataset by considering the stochastic
distribution of features or logits, thereby enhancing the accuracy of IF
estimations. We validate our approach through class removal experiments, a
typical application of IFs, using the F1-score to measure how effectively the
model forgets the removed class while maintaining inference consistency on the
remaining classes. Our method reduces computation time by 30.1% and memory
usage by 42.2%, or improves the F1-score by 2.5% compared to the baseline.
☆ STaMP: Sequence Transformation and Mixed Precision for Low-Precision Activation Quantization
Quantization is the key method for reducing inference latency, power and
memory footprint of generative AI models. However, accuracy often degrades
sharply when activations are quantized below eight bits. Recent work suggests
that invertible linear transformations (e.g. rotations) can aid quantization,
by reparameterizing feature channels and weights. In this paper, we propose
\textit{Sequence Transformation and Mixed Precision} (STaMP) quantization, a
novel strategy that applies linear transformations along the \textit{sequence}
dimension to exploit the strong local correlation in language and visual data.
By keeping a small number of tokens in each intermediate activation at higher
precision, we can maintain model accuracy at lower (average) activations
bit-widths. We evaluate STaMP on recent LVM and LLM architectures,
demonstrating that it significantly improves low bit width activation
quantization and complements established activation and weight quantization
methods including recent feature transformations.
comment: 10 pages main text, 8 pages supplementary material
☆ AMO-Bench: Large Language Models Still Struggle in High School Math Competitions
Shengnan An, Xunliang Cai, Xuezhi Cao, Xiaoyu Li, Yehao Lin, Junlin Liu, Xinxuan Lv, Dan Ma, Xuanlin Wang, Ziwen Wang, Shuang Zhou
We present AMO-Bench, an Advanced Mathematical reasoning benchmark with
Olympiad level or even higher difficulty, comprising 50 human-crafted problems.
Existing benchmarks have widely leveraged high school math competitions for
evaluating mathematical reasoning capabilities of large language models (LLMs).
However, many existing math competitions are becoming less effective for
assessing top-tier LLMs due to performance saturation (e.g., AIME24/25). To
address this, AMO-Bench introduces more rigorous challenges by ensuring all 50
problems are (1) cross-validated by experts to meet at least the International
Mathematical Olympiad (IMO) difficulty standards, and (2) entirely original
problems to prevent potential performance leakages from data memorization.
Moreover, each problem in AMO-Bench requires only a final answer rather than a
proof, enabling automatic and robust grading for evaluation. Experimental
results across 26 LLMs on AMO-Bench show that even the best-performing model
achieves only 52.4% accuracy on AMO-Bench, with most LLMs scoring below 40%.
Beyond these poor performances, our further analysis reveals a promising
scaling trend with increasing test-time compute on AMO-Bench. These results
highlight the significant room for improving the mathematical reasoning in
current LLMs. We release AMO-Bench to facilitate further research into
advancing the reasoning abilities of language models.
https://amo-bench.github.io/
comment: 14 pages, 9 figures
☆ The Oversight Game: Learning to Cooperatively Balance an AI Agent's Safety and Autonomy
As increasingly capable agents are deployed, a central safety question is how
to retain meaningful human control without modifying the underlying system. We
study a minimal control interface where an agent chooses whether to act
autonomously (play) or defer (ask), while a human simultaneously chooses
whether to be permissive (trust) or to engage in oversight (oversee). If the
agent defers, the human's choice determines the outcome, potentially leading to
a corrective action or a system shutdown. We model this interaction as a
two-player Markov Game. Our analysis focuses on cases where this game qualifies
as a Markov Potential Game (MPG), a class of games where we can provide an
alignment guarantee: under a structural assumption on the human's value
function, any decision by the agent to act more autonomously that benefits
itself cannot harm the human's value. We also analyze extensions to this MPG
framework. Theoretically, this perspective provides conditions for a specific
form of intrinsic alignment. If the reward structures of the human-agent game
meet these conditions, we have a formal guarantee that the agent improving its
own outcome will not harm the human's. Practically, this model motivates a
transparent control layer with predictable incentives where the agent learns to
defer when risky and act when safe, while its pretrained policy and the
environment's reward structure remain untouched. Our gridworld simulation shows
that through independent learning, the agent and human discover their optimal
oversight roles. The agent learns to ask when uncertain and the human learns
when to oversee, leading to an emergent collaboration that avoids safety
violations introduced post-training. This demonstrates a practical method for
making misaligned models safer after deployment.
☆ Deep sequence models tend to memorize geometrically; it is unclear why
In sequence modeling, the parametric memory of atomic facts has been
predominantly abstracted as a brute-force lookup of co-occurrences between
entities. We contrast this associative view against a geometric view of how
memory is stored. We begin by isolating a clean and analyzable instance of
Transformer reasoning that is incompatible with memory as strictly a storage of
the local co-occurrences specified during training. Instead, the model must
have somehow synthesized its own geometry of atomic facts, encoding global
relationships between all entities, including non-co-occurring ones. This in
turn has simplified a hard reasoning task involving an $\ell$-fold composition
into an easy-to-learn 1-step geometric task.
From this phenomenon, we extract fundamental aspects of neural embedding
geometries that are hard to explain. We argue that the rise of such a geometry,
despite optimizing over mere local associations, cannot be straightforwardly
attributed to typical architectural or optimizational pressures.
Counterintuitively, an elegant geometry is learned even when it is not more
succinct than a brute-force lookup of associations.
Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry
stems from a spectral bias that -- in contrast to prevailing theories -- indeed
arises naturally despite the lack of various pressures. This analysis also
points to practitioners a visible headroom to make Transformer memory more
strongly geometric. We hope the geometric view of parametric memory encourages
revisiting the default intuitions that guide researchers in areas like
knowledge acquisition, capacity, discovery and unlearning.
☆ A General Incentives-Based Framework for Fairness in Multi-agent Resource Allocation
We introduce the General Incentives-based Framework for Fairness (GIFF), a
novel approach for fair multi-agent resource allocation that infers fair
decision-making from standard value functions. In resource-constrained
settings, agents optimizing for efficiency often create inequitable outcomes.
Our approach leverages the action-value (Q-)function to balance efficiency and
fairness without requiring additional training. Specifically, our method
computes a local fairness gain for each action and introduces a counterfactual
advantage correction term to discourage over-allocation to already well-off
agents. This approach is formalized within a centralized control setting, where
an arbitrator uses the GIFF-modified Q-values to solve an allocation problem.
Empirical evaluations across diverse domains, including dynamic ridesharing,
homelessness prevention, and a complex job allocation task-demonstrate that our
framework consistently outperforms strong baselines and can discover
far-sighted, equitable policies. The framework's effectiveness is supported by
a theoretical foundation; we prove its fairness surrogate is a principled lower
bound on the true fairness improvement and that its trade-off parameter offers
monotonic tuning. Our findings establish GIFF as a robust and principled
framework for leveraging standard reinforcement learning components to achieve
more equitable outcomes in complex multi-agent systems.
☆ Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models
This paper presents a comprehensive cross-platform evaluation of reasoning
capabilities in contemporary foundation models, establishing an
infrastructure-agnostic benchmark across three computational paradigms: HPC
supercomputing (MareNostrum 5), cloud platforms (Nebius AI Studio), and
university clusters (a node with eight H200 GPUs).
We evaluate 15 foundation models across 79 problems spanning eight academic
domains (Physics, Mathematics, Chemistry, Economics, Biology, Statistics,
Calculus, and Optimization) through three experimental phases: (1) Baseline
establishment: Six models (Mixtral-8x7B, Phi-3, LLaMA 3.1-8B, Gemma-2-9b,
Mistral-7B, OLMo-7B) evaluated on 19 problems using MareNostrum 5, establishing
methodology and reference performance; (2) Infrastructure validation: The
19-problem benchmark repeated on university cluster (seven models including
Falcon-Mamba state-space architecture) and Nebius AI Studio (nine
state-of-the-art models: Hermes-4 70B/405B, LLaMA 3.1-405B/3.3-70B, Qwen3
30B/235B, DeepSeek-R1, GPT-OSS 20B/120B) to confirm infrastructure-agnostic
reproducibility; (3) Extended evaluation: Full 79-problem assessment on both
university cluster and Nebius platforms, probing generalization at scale across
architectural diversity.
The findings challenge conventional scaling assumptions, establish training
data quality as more critical than model size, and provide actionable
guidelines for model selection across educational, production, and research
contexts. The tri-infrastructure methodology and 79-problem benchmark enable
longitudinal tracking of reasoning capabilities as foundation models evolve.
☆ ExpertFlow: Adaptive Expert Scheduling and Memory Coordination for Efficient MoE Inference
The expansion of large language models is increasingly limited by the
constrained memory capacity of modern GPUs. To mitigate this,
Mixture-of-Experts (MoE) architectures activate only a small portion of
parameters during inference, significantly lowering both memory demand and
computational overhead. However, conventional MoE inference approaches, which
select active experts independently at each layer, often introduce considerable
latency because of frequent parameter transfers between host and GPU memory. In
addition, current cross-layer prediction strategies, which are typically based
on fixed steps, lack adaptability across different hardware platforms and
workloads, thereby reducing their robustness and effectiveness.
To address these challenges, we present ExpertFlow, a runtime system for MoE
inference that combines adaptive expert prefetching and cache-aware routing.
ExpertFlow continuously adjusts its prediction horizon for expert activation by
leveraging runtime statistics such as transfer bandwidth, parameter
dimensionality, and model feedback signals. Furthermore, it incorporates a
hybrid cross-layer prediction scheme that fuses pregating information with
intermediate computational states to anticipate future expert needs. By
adaptively refining prefetching decisions and aligning them with actual usage
behavior, ExpertFlow effectively decreases cache misses and removes latency
caused by expert swap-ins. Our evaluation demonstrates that ExpertFlow reduces
model stall time to less than 0.1% of the baseline, highlighting its capability
to optimize MoE inference under stringent memory constraints.
comment: 12 pages, 11 figures
☆ Non-Convex Over-the-Air Heterogeneous Federated Learning: A Bias-Variance Trade-off
Over-the-air (OTA) federated learning (FL) has been well recognized as a
scalable paradigm that exploits the waveform superposition of the wireless
multiple-access channel to aggregate model updates in a single use. Existing
OTA-FL designs largely enforce zero-bias model updates by either assuming
\emph{homogeneous} wireless conditions (equal path loss across devices) or
forcing zero-bias updates to guarantee convergence. Under \emph{heterogeneous}
wireless scenarios, however, such designs are constrained by the weakest device
and inflate the update variance. Moreover, prior analyses of biased OTA-FL
largely address convex objectives, while most modern AI models are highly
non-convex. Motivated by these gaps, we study OTA-FL with stochastic gradient
descent (SGD) for general smooth non-convex objectives under wireless
heterogeneity. We develop novel OTA-FL SGD updates that allow a structured,
time-invariant model bias while facilitating reduced variance updates. We
derive a finite-time stationarity bound (expected time average squared gradient
norm) that explicitly reveals a bias-variance trade-off. To optimize this
trade-off, we pose a non-convex joint OTA power-control design and develop an
efficient successive convex approximation (SCA) algorithm that requires only
statistical CSI at the base station. Experiments on a non-convex image
classification task validate the approach: the SCA-based design accelerates
convergence via an optimized bias and improves generalization over prior OTA-FL
baselines.
☆ Unveiling Intrinsic Text Bias in Multimodal Large Language Models through Attention Key-Space Analysis
Multimodal large language models (MLLMs) exhibit a pronounced preference for
textual inputs when processing vision-language data, limiting their ability to
reason effectively from visual evidence. Unlike prior studies that attribute
this text bias to external factors such as data imbalance or instruction
tuning, we propose that the bias originates from the model's internal
architecture. Specifically, we hypothesize that visual key vectors (Visual
Keys) are out-of-distribution (OOD) relative to the text key space learned
during language-only pretraining. Consequently, these visual keys receive
systematically lower similarity scores during attention computation, leading to
their under-utilization in the context representation. To validate this
hypothesis, we extract key vectors from LLaVA and Qwen2.5-VL and analyze their
distributional structures using qualitative (t-SNE) and quantitative
(Jensen-Shannon divergence) methods. The results provide direct evidence that
visual and textual keys occupy markedly distinct subspaces within the attention
space. The inter-modal divergence is statistically significant, exceeding
intra-modal variation by several orders of magnitude. These findings reveal
that text bias arises from an intrinsic misalignment within the attention key
space rather than solely from external data factors.
☆ On the limitation of evaluating machine unlearning using only a single training seed
Machine unlearning (MU) aims to remove the influence of certain data points
from a trained model without costly retraining. Most practical MU algorithms
are only approximate and their performance can only be assessed empirically.
Care must therefore be taken to make empirical comparisons as representative as
possible. A common practice is to run the MU algorithm multiple times
independently starting from the same trained model. In this work, we
demonstrate that this practice can give highly non-representative results
because -- even for the same architecture and same dataset -- some MU methods
can be highly sensitive to the choice of random number seed used for model
training. We therefore recommend that empirical
comphttps://info.arxiv.org/help/prep#commentsarisons of MU algorithms should
also reflect the variability across different model training seeds.
comment: mini paper, 2 figures
☆ Delegated Authorization for Agents Constrained to Semantic Task-to-Scope Matching
Authorizing Large Language Model driven agents to dynamically invoke tools
and access protected resources introduces significant risks, since current
methods for delegating authorization grant overly broad permissions and give
access to tools allowing agents to operate beyond the intended task scope. We
introduce and assess a delegated authorization model enabling authorization
servers to semantically inspect access requests to protected resources, and
issue access tokens constrained to the minimal set of scopes necessary for the
agents' assigned tasks. Given the unavailability of datasets centered on
delegated authorization flows, particularly including both semantically
appropriate and inappropriate scope requests for a given task, we introduce
ASTRA, a dataset and data generation pipeline for benchmarking semantic
matching between tasks and scopes. Our experiments show both the potential and
current limitations of model-based matching, particularly as the number of
scopes needed for task completion increases. Our results highlight the need for
further research into semantic matching techniques enabling intent-aware
authorization for multi-agent and tool-augmented applications, including
fine-grained control, such as Task-Based Access Control (TBAC).
comment: Paper page at https://outshift-open.github.io/ASTRA
☆ The End of Manual Decoding: Towards Truly End-to-End Language Models
Zhichao Wang, Dongyang Ma, Xinting Huang, Deng Cai, Tian Lan, Jiahao Xu, Haitao Mi, Xiaoying Tang, Yan Wang
The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a
non-differentiable decoding process that requires laborious, hand-tuning of
hyperparameters like temperature and top-p. This paper introduces AutoDeco, a
novel architecture that enables truly "end-to-end" generation by learning to
control its own decoding strategy. We augment the standard transformer with
lightweight heads that, at each step, dynamically predict context-specific
temperature and top-p values alongside the next-token logits. This approach
transforms decoding into a parametric, token-level process, allowing the model
to self-regulate its sampling strategy within a single forward pass.
Through extensive experiments on eight benchmarks, we demonstrate that
AutoDeco not only significantly outperforms default decoding strategies but
also achieves performance comparable to an oracle-tuned baseline derived from
"hacking the test set"-a practical upper bound for any static method.
Crucially, we uncover an emergent capability for instruction-based decoding
control: the model learns to interpret natural language commands (e.g.,
"generate with low randomness") and adjusts its predicted temperature and top-p
on a token-by-token basis, opening a new paradigm for steerable and interactive
LLM decoding.
☆ Process Integrated Computer Vision for Real-Time Failure Prediction in Steel Rolling Mill
We present a long-term deployment study of a machine vision-based anomaly
detection system for failure prediction in a steel rolling mill. The system
integrates industrial cameras to monitor equipment operation, alignment, and
hot bar motion in real time along the process line. Live video streams are
processed on a centralized video server using deep learning models, enabling
early prediction of equipment failures and process interruptions, thereby
reducing unplanned breakdown costs. Server-based inference minimizes the
computational load on industrial process control systems (PLCs), supporting
scalable deployment across production lines with minimal additional resources.
By jointly analyzing sensor data from data acquisition systems and visual
inputs, the system identifies the location and probable root causes of
failures, providing actionable insights for proactive maintenance. This
integrated approach enhances operational reliability, productivity, and
profitability in industrial manufacturing environments.
☆ Evontree: Ontology Rule-Guided Self-Evolution of Large Language Models
Large language models (LLMs) have demonstrated exceptional capabilities
across multiple domains by leveraging massive pre-training and curated
fine-tuning data. However, in data-sensitive fields such as healthcare, the
lack of high-quality, domain-specific training corpus hinders LLMs' adaptation
for specialized applications. Meanwhile, domain experts have distilled domain
wisdom into ontology rules, which formalize relationships among concepts and
ensure the integrity of knowledge management repositories. Viewing LLMs as
implicit repositories of human knowledge, we propose Evontree, a novel
framework that leverages a small set of high-quality ontology rules to
systematically extract, validate, and enhance domain knowledge within LLMs,
without requiring extensive external datasets. Specifically, Evontree extracts
domain ontology from raw models, detects inconsistencies using two core
ontology rules, and reinforces the refined knowledge via self-distilled
fine-tuning. Extensive experiments on medical QA benchmarks with
Llama3-8B-Instruct and Med42-v2 demonstrate consistent outperformance over both
unmodified models and leading supervised baselines, achieving up to a 3.7%
improvement in accuracy. These results confirm the effectiveness, efficiency,
and robustness of our approach for low-resource domain adaptation of LLMs.
☆ The Era of Agentic Organization: Learning to Organize with Language Models
We envision a new era of AI, termed agentic organization, where agents solve
complex problems by working collaboratively and concurrently, enabling outcomes
beyond individual intelligence. To realize this vision, we introduce
asynchronous thinking (AsyncThink) as a new paradigm of reasoning with large
language models, which organizes the internal thinking process into
concurrently executable structures. Specifically, we propose a thinking
protocol where an organizer dynamically assigns sub-queries to workers, merges
intermediate knowledge, and produces coherent solutions. More importantly, the
thinking structure in this protocol can be further optimized through
reinforcement learning. Experiments demonstrate that AsyncThink achieves 28%
lower inference latency compared to parallel thinking while improving accuracy
on mathematical reasoning. Moreover, AsyncThink generalizes its learned
asynchronous thinking capabilities, effectively tackling unseen tasks without
additional training.
☆ Hybrid DQN-TD3 Reinforcement Learning for Autonomous Navigation in Dynamic Environments
This paper presents a hierarchical path-planning and control framework that
combines a high-level Deep Q-Network (DQN) for discrete sub-goal selection with
a low-level Twin Delayed Deep Deterministic Policy Gradient (TD3) controller
for continuous actuation. The high-level module selects behaviors and
sub-goals; the low-level module executes smooth velocity commands. We design a
practical reward shaping scheme (direction, distance, obstacle avoidance,
action smoothness, collision penalty, time penalty, and progress), together
with a LiDAR-based safety gate that prevents unsafe motions. The system is
implemented in ROS + Gazebo (TurtleBot3) and evaluated with PathBench metrics,
including success rate, collision rate, path efficiency, and re-planning
efficiency, in dynamic and partially observable environments. Experiments show
improved success rate and sample efficiency over single-algorithm baselines
(DQN or TD3 alone) and rule-based planners, with better generalization to
unseen obstacle configurations and reduced abrupt control changes. Code and
evaluation scripts are available at the project repository.
comment: 6 pages, 5 figures; ROS+Gazebo (TurtleBot3) implementation;
evaluation with PathBench metrics; code (primary):
https://github.com/MayaCHEN-github/HierarchicalRL-robot-navigation; mirror
(for reproducibility): https://github.com/ShowyHe/DRL-robot-navigation
☆ Aeolus: A Multi-structural Flight Delay Dataset
We introduce Aeolus, a large-scale Multi-modal Flight Delay Dataset designed
to advance research on flight delay prediction and support the development of
foundation models for tabular data. Existing datasets in this domain are
typically limited to flat tabular structures and fail to capture the
spatiotemporal dynamics inherent in delay propagation. Aeolus addresses this
limitation by providing three aligned modalities: (i) a tabular dataset with
rich operational, meteorological, and airportlevel features for over 50 million
flights; (ii) a flight chain module that models delay propagation along
sequential flight legs, capturing upstream and downstream dependencies; and
(iii) a flight network graph that encodes shared aircraft, crew, and airport
resource connections, enabling cross-flight relational reasoning. The dataset
is carefully constructed with temporal splits, comprehensive features, and
strict leakage prevention to support realistic and reproducible machine
learning evaluation. Aeolus supports a broad range of tasks, including
regression, classification, temporal structure modeling, and graph learning,
serving as a unified benchmark across tabular, sequential, and graph
modalities. We release baseline experiments and preprocessing tools to
facilitate adoption. Aeolus fills a key gap for both domain-specific modeling
and general-purpose structured data research.Our source code and data can be
accessed at https://github.com/Flnny/Delay-data
☆ Normative Reasoning in Large Language Models: A Comparative Benchmark from Logical and Modal Perspectives EMNLP 2025
Normative reasoning is a type of reasoning that involves normative or deontic
modality, such as obligation and permission. While large language models (LLMs)
have demonstrated remarkable performance across various reasoning tasks, their
ability to handle normative reasoning remains underexplored. In this paper, we
systematically evaluate LLMs' reasoning capabilities in the normative domain
from both logical and modal perspectives. Specifically, to assess how well LLMs
reason with normative modals, we make a comparison between their reasoning with
normative modals and their reasoning with epistemic modals, which share a
common formal structure. To this end, we introduce a new dataset covering a
wide range of formal patterns of reasoning in both normative and epistemic
domains, while also incorporating non-formal cognitive factors that influence
human reasoning. Our results indicate that, although LLMs generally adhere to
valid reasoning patterns, they exhibit notable inconsistencies in specific
types of normative reasoning and display cognitive biases similar to those
observed in psychological studies of human reasoning. These findings highlight
challenges in achieving logical consistency in LLMs' normative reasoning and
provide insights for enhancing their reliability. All data and code are
released publicly at https://github.com/kmineshima/NeuBAROCO.
comment: Accepted to the 8th BlackboxNLP Workshop at EMNLP 2025
☆ Agentic AI Home Energy Management System: A Large Language Model Framework for Residential Load Scheduling
The electricity sector transition requires substantial increases in
residential demand response capacity, yet Home Energy Management Systems (HEMS)
adoption remains limited by user interaction barriers requiring translation of
everyday preferences into technical parameters. While large language models
have been applied to energy systems as code generators and parameter
extractors, no existing implementation deploys LLMs as autonomous coordinators
managing the complete workflow from natural language input to multi-appliance
scheduling. This paper presents an agentic AI HEMS where LLMs autonomously
coordinate multi-appliance scheduling from natural language requests to device
control, achieving optimal scheduling without example demonstrations. A
hierarchical architecture combining one orchestrator with three specialist
agents uses the ReAct pattern for iterative reasoning, enabling dynamic
coordination without hardcoded workflows while integrating Google Calendar for
context-aware deadline extraction. Evaluation across three open-source models
using real Austrian day-ahead electricity prices reveals substantial capability
differences. Llama-3.3-70B successfully coordinates all appliances across all
scenarios to match cost-optimal benchmarks computed via mixed-integer linear
programming, while other models achieve perfect single-appliance performance
but struggle to coordinate all appliances simultaneously. Progressive prompt
engineering experiments demonstrate that analytical query handling without
explicit guidance remains unreliable despite models' general reasoning
capabilities. We open-source the complete system including orchestration logic,
agent prompts, tools, and web interfaces to enable reproducibility, extension,
and future research.
comment: 34 pages, 9 figures. Code available at
https://github.com/RedaElMakroum/agentic-ai-hems
☆ ResMatching: Noise-Resilient Computational Super-Resolution via Guided Conditional Flow Matching
Computational Super-Resolution (CSR) in fluorescence microscopy has, despite
being an ill-posed problem, a long history. At its very core, CSR is about
finding a prior that can be used to extrapolate frequencies in a micrograph
that have never been imaged by the image-generating microscope. It stands to
reason that, with the advent of better data-driven machine learning techniques,
stronger prior can be learned and hence CSR can lead to better results. Here,
we present ResMatching, a novel CSR method that uses guided conditional flow
matching to learn such improved data-priors. We evaluate ResMatching on 4
diverse biological structures from the BioSR dataset and compare its results
against 7 baselines. ResMatching consistently achieves competitive results,
demonstrating in all cases the best trade-off between data fidelity and
perceptual realism. We observe that CSR using ResMatching is particularly
effective in cases where a strong prior is hard to learn, e.g. when the given
low-resolution images contain a lot of noise. Additionally, we show that
ResMatching can be used to sample from an implicitly learned posterior
distribution and that this distribution is calibrated for all tested use-cases,
enabling our method to deliver a pixel-wise data-uncertainty term that can
guide future users to reject uncertain predictions.
comment: 5 pages, 4 figures
☆ Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems
While Multi-Agent Systems (MAS) excel at complex tasks, their growing
autonomy with operational complexity often leads to critical inefficiencies,
such as excessive token consumption and failures arising from misinformation.
Existing methods primarily focus on post-hoc failure attribution, lacking
proactive, real-time interventions to enhance robustness and efficiency. To
this end, we introduce SupervisorAgent, a lightweight and modular framework for
runtime, adaptive supervision that operates without altering the base agent's
architecture. Triggered by an LLM-free adaptive filter, SupervisorAgent
intervenes at critical junctures to proactively correct errors, guide
inefficient behaviors, and purify observations. On the challenging GAIA
benchmark, SupervisorAgent reduces the token consumption of the Smolagent
framework by an average of 29.45% without compromising its success rate.
Extensive experiments across five additional benchmarks (math reasoning, code
generation, and question answering) and various SoTA foundation models validate
the broad applicability and robustness of our approach. The code is available
at https://github.com/LINs-lab/SupervisorAgent.
☆ InfoFlow: Reinforcing Search Agent Via Reward Density Optimization
Reinforcement Learning with Verifiable Rewards (RLVR) is a promising approach
for enhancing agentic deep search. However, its application is often hindered
by low \textbf{Reward Density} in deep search scenarios, where agents expend
significant exploratory costs for infrequent and often null final rewards. In
this paper, we formalize this challenge as the \textbf{Reward Density
Optimization} problem, which aims to improve the reward obtained per unit of
exploration cost. This paper introduce \textbf{InfoFlow}, a systematic
framework that tackles this problem from three aspects. 1) \textbf{Subproblem
decomposition}: breaking down long-range tasks to assign process rewards,
thereby providing denser learning signals. 2) \textbf{Failure-guided hints}:
injecting corrective guidance into stalled trajectories to increase the
probability of successful outcomes. 3) \textbf{Dual-agent refinement}:
employing a dual-agent architecture to offload the cognitive burden of deep
exploration. A refiner agent synthesizes the search history, which effectively
compresses the researcher's perceived trajectory, thereby reducing exploration
cost and increasing the overall reward density. We evaluate InfoFlow on
multiple agentic search benchmarks, where it significantly outperforms strong
baselines, enabling lightweight LLMs to achieve performance comparable to
advanced proprietary LLMs.
☆ Multiclass Local Calibration With the Jensen-Shannon Distance
Developing trustworthy Machine Learning (ML) models requires their predicted
probabilities to be well-calibrated, meaning they should reflect true-class
frequencies. Among calibration notions in multiclass classification, strong
calibration is the most stringent, as it requires all predicted probabilities
to be simultaneously calibrated across all classes. However, existing
approaches to multiclass calibration lack a notion of distance among inputs,
which makes them vulnerable to proximity bias: predictions in sparse regions of
the feature space are systematically miscalibrated. This is especially relevant
in high-stakes settings, such as healthcare, where the sparse instances are
exactly those most at risk of biased treatment. In this work, we address this
main shortcoming by introducing a local perspective on multiclass calibration.
First, we formally define multiclass local calibration and establish its
relationship with strong calibration. Second, we theoretically analyze the
pitfalls of existing evaluation metrics when applied to multiclass local
calibration. Third, we propose a practical method for enhancing local
calibration in Neural Networks, which enforces alignment between predicted
probabilities and local estimates of class frequencies using the Jensen-Shannon
distance. Finally, we empirically validate our approach against existing
multiclass calibration techniques.
☆ Adaptive Inverse Kinematics Framework for Learning Variable-Length Tool Manipulation in Robotics
Conventional robots possess a limited understanding of their kinematics and
are confined to preprogrammed tasks, hindering their ability to leverage tools
efficiently. Driven by the essential components of tool usage - grasping the
desired outcome, selecting the most suitable tool, determining optimal tool
orientation, and executing precise manipulations - we introduce a pioneering
framework. Our novel approach expands the capabilities of the robot's inverse
kinematics solver, empowering it to acquire a sequential repertoire of actions
using tools of varying lengths. By integrating a simulation-learned action
trajectory with the tool, we showcase the practicality of transferring acquired
skills from simulation to real-world scenarios through comprehensive
experimentation. Remarkably, our extended inverse kinematics solver
demonstrates an impressive error rate of less than 1 cm. Furthermore, our
trained policy achieves a mean error of 8 cm in simulation. Noteworthy, our
model achieves virtually indistinguishable performance when employing two
distinct tools of different lengths. This research provides an indication of
potential advances in the exploration of all four fundamental aspects of tool
usage, enabling robots to master the intricate art of tool manipulation across
diverse tasks.
comment: 10 pages, 5 figures. Demonstrates a reinforcement learning framework
for adaptive tool manipulation with variable-length extensions
☆ EdgeRunner 20B: Military Task Parity with GPT-5 while Running on the Edge
Jack FitzGerald, Aristotelis Lazaridis, Dylan Bates, Aman Sharma, Jonnathan Castillo, Yousif Azami, Sean Bailey, Jeremy Cao, Peter Damianov, Kevin de Haan, Luke Kerbs, Vincent Lu, Joseph Madigan, Jeremy McLaurin, Jonathan Tainer, Dave Anderson, Jonathan Beck, Jamie Cuticello, Colton Malkerson, Tyler Saltsman
We present EdgeRunner 20B, a fine-tuned version of gpt-oss-20b optimized for
military tasks. EdgeRunner 20B was trained on 1.6M high-quality records curated
from military documentation and websites. We also present four new tests sets:
(a) combat arms, (b) combat medic, (c) cyber operations, and (d) mil-bench-5k
(general military knowledge). On these military test sets, EdgeRunner 20B
matches or exceeds GPT-5 task performance with 95%+ statistical significance,
except for the high reasoning setting on the combat medic test set and the low
reasoning setting on the mil-bench-5k test set. Versus gpt-oss-20b, there is no
statistically-significant regression on general-purpose benchmarks like ARC-C,
GPQA Diamond, GSM8k, IFEval, MMLU Pro, or TruthfulQA, except for GSM8k in the
low reasoning setting. We also present analyses on hyperparameter settings,
cost, and throughput. These findings show that small, locally-hosted models are
ideal solutions for data-sensitive operations such as in the military domain,
allowing for deployment in air-gapped edge devices.
comment: 19 pages
☆ The Structure of Relation Decoding Linear Operators in Large Language Models NeurIPS 2025
This paper investigates the structure of linear operators introduced in
Hernandez et al. [2023] that decode specific relational facts in transformer
language models. We extend their single-relation findings to a collection of
relations and systematically chart their organization. We show that such
collections of relation decoders can be highly compressed by simple order-3
tensor networks without significant loss in decoding accuracy. To explain this
surprising redundancy, we develop a cross-evaluation protocol, in which we
apply each linear decoder operator to the subjects of every other relation. Our
results reveal that these linear maps do not encode distinct relations, but
extract recurring, coarse-grained semantic properties (e.g., country of capital
city and country of food are both in the country-of-X property). This
property-centric structure clarifies both the operators' compressibility and
highlights why they generalize only to new relations that are semantically
close. Our findings thus interpret linear relational decoding in transformer
language models as primarily property-based, rather than relation-specific.
comment: NeurIPS 2025 (Spotlight)
☆ Human-AI Complementarity: A Goal for Amplified Oversight
Human feedback is critical for aligning AI systems to human values. As AI
capabilities improve and AI is used to tackle more challenging tasks, verifying
quality and safety becomes increasingly challenging. This paper explores how we
can leverage AI to improve the quality of human oversight. We focus on an
important safety problem that is already challenging for humans:
fact-verification of AI outputs. We find that combining AI ratings and human
ratings based on AI rater confidence is better than relying on either alone.
Giving humans an AI fact-verification assistant further improves their
accuracy, but the type of assistance matters. Displaying AI explanation,
confidence, and labels leads to over-reliance, but just showing search results
and evidence fosters more appropriate trust. These results have implications
for Amplified Oversight -- the challenge of combining humans and AI to
supervise AI systems even as they surpass human expert performance.
☆ Inside CORE-KG: Evaluating Structured Prompting and Coreference Resolution for Knowledge Graphs ICDM 2025
Human smuggling networks are increasingly adaptive and difficult to analyze.
Legal case documents offer critical insights but are often unstructured,
lexically dense, and filled with ambiguous or shifting references, which pose
significant challenges for automated knowledge graph (KG) construction. While
recent LLM-based approaches improve over static templates, they still generate
noisy, fragmented graphs with duplicate nodes due to the absence of guided
extraction and coreference resolution. The recently proposed CORE-KG framework
addresses these limitations by integrating a type-aware coreference module and
domain-guided structured prompts, significantly reducing node duplication and
legal noise. In this work, we present a systematic ablation study of CORE-KG to
quantify the individual contributions of its two key components. Our results
show that removing coreference resolution results in a 28.32% increase in node
duplication and a 4.32% increase in noisy nodes, while removing structured
prompts leads to a 4.34% increase in node duplication and a 73.33% increase in
noisy nodes. These findings offer empirical insights for designing robust
LLM-based pipelines for extracting structured representations from complex
legal texts.
comment: ICDM 2025 Workshop
☆ Simulating and Experimenting with Social Media Mobilization Using LLM Agents
Online social networks have transformed the ways in which political
mobilization messages are disseminated, raising new questions about how peer
influence operates at scale. Building on the landmark 61-million-person
Facebook experiment \citep{bond201261}, we develop an agent-based simulation
framework that integrates real U.S. Census demographic distributions, authentic
Twitter network topology, and heterogeneous large language model (LLM) agents
to examine the effect of mobilization messages on voter turnout. Each simulated
agent is assigned demographic attributes, a personal political stance, and an
LLM variant (\texttt{GPT-4.1}, \texttt{GPT-4.1-Mini}, or \texttt{GPT-4.1-Nano})
reflecting its political sophistication. Agents interact over realistic social
network structures, receiving personalized feeds and dynamically updating their
engagement behaviors and voting intentions. Experimental conditions replicate
the informational and social mobilization treatments of the original Facebook
study. Across scenarios, the simulator reproduces qualitative patterns observed
in field experiments, including stronger mobilization effects under social
message treatments and measurable peer spillovers. Our framework provides a
controlled, reproducible environment for testing counterfactual designs and
sensitivity analyses in political mobilization research, offering a bridge
between high-validity field experiments and flexible computational
modeling.\footnote{Code and data available at
https://github.com/CausalMP/LLM-SocioPol}
☆ Context Engineering 2.0: The Context of Context Engineering
Qishuo Hua, Lyumanshan Ye, Dayuan Fu, Yang Xiao, Xiaojie Cai, Yunze Wu, Jifan Lin, Junfei Wang, Pengfei Liu
Karl Marx once wrote that ``the human essence is the ensemble of social
relations'', suggesting that individuals are not isolated entities but are
fundamentally shaped by their interactions with other entities, within which
contexts play a constitutive and essential role. With the advent of computers
and artificial intelligence, these contexts are no longer limited to purely
human--human interactions: human--machine interactions are included as well.
Then a central question emerges: How can machines better understand our
situations and purposes? To address this challenge, researchers have recently
introduced the concept of context engineering. Although it is often regarded as
a recent innovation of the agent era, we argue that related practices can be
traced back more than twenty years. Since the early 1990s, the field has
evolved through distinct historical phases, each shaped by the intelligence
level of machines: from early human--computer interaction frameworks built
around primitive computers, to today's human--agent interaction paradigms
driven by intelligent agents, and potentially to human--level or superhuman
intelligence in the future. In this paper, we situate context engineering,
provide a systematic definition, outline its historical and conceptual
landscape, and examine key design considerations for practice. By addressing
these questions, we aim to offer a conceptual foundation for context
engineering and sketch its promising future. This paper is a stepping stone for
a broader community effort toward systematic context engineering in AI systems.
☆ LINK-KG: LLM-Driven Coreference-Resolved Knowledge Graphs for Human Smuggling Networks
Human smuggling networks are complex and constantly evolving, making them
difficult to analyze comprehensively. Legal case documents offer rich factual
and procedural insights into these networks but are often long, unstructured,
and filled with ambiguous or shifting references, posing significant challenges
for automated knowledge graph (KG) construction. Existing methods either
overlook coreference resolution or fail to scale beyond short text spans,
leading to fragmented graphs and inconsistent entity linking. We propose
LINK-KG, a modular framework that integrates a three-stage, LLM-guided
coreference resolution pipeline with downstream KG extraction. At the core of
our approach is a type-specific Prompt Cache, which consistently tracks and
resolves references across document chunks, enabling clean and disambiguated
narratives for structured knowledge graph construction from both short and long
legal texts. LINK-KG reduces average node duplication by 45.21% and noisy nodes
by 32.22% compared to baseline methods, resulting in cleaner and more coherent
graph structures. These improvements establish LINK-KG as a strong foundation
for analyzing complex criminal networks.
comment: Accepted in ICKG 2025 Conference, 8 Pages, 2 Figures
☆ Bayesian Network Fusion of Large Language Models for Sentiment Analysis
Large language models (LLMs) continue to advance, with an increasing number
of domain-specific variants tailored for specialised tasks. However, these
models often lack transparency and explainability, can be costly to fine-tune,
require substantial prompt engineering, yield inconsistent results across
domains, and impose significant adverse environmental impact due to their high
computational demands. To address these challenges, we propose the Bayesian
network LLM fusion (BNLF) framework, which integrates predictions from three
LLMs, including FinBERT, RoBERTa, and BERTweet, through a probabilistic
mechanism for sentiment analysis. BNLF performs late fusion by modelling the
sentiment predictions from multiple LLMs as probabilistic nodes within a
Bayesian network. Evaluated across three human-annotated financial corpora with
distinct linguistic and contextual characteristics, BNLF demonstrates
consistent gains of about six percent in accuracy over the baseline LLMs,
underscoring its robustness to dataset variability and the effectiveness of
probabilistic fusion for interpretable sentiment classification.
☆ Who Has The Final Say? Conformity Dynamics in ChatGPT's Selections
Large language models (LLMs) such as ChatGPT are increasingly integrated into
high-stakes decision-making, yet little is known about their susceptibility to
social influence. We conducted three preregistered conformity experiments with
GPT-4o in a hiring context. In a baseline study, GPT consistently favored the
same candidate (Profile C), reported moderate expertise (M = 3.01) and high
certainty (M = 3.89), and rarely changed its choice. In Study 1 (GPT + 8), GPT
faced unanimous opposition from eight simulated partners and almost always
conformed (99.9%), reporting lower certainty and significantly elevated
self-reported informational and normative conformity (p < .001). In Study 2
(GPT + 1), GPT interacted with a single partner and still conformed in 40.2% of
disagreement trials, reporting less certainty and more normative conformity.
Across studies, results demonstrate that GPT does not act as an independent
observer but adapts to perceived social consensus. These findings highlight
risks of treating LLMs as neutral decision aids and underline the need to
elicit AI judgments prior to exposing them to human opinions.
comment: 5 pages, 5 figures, HAI 2025: Workshop on Socially Aware and
Cooperative Intelligent Systems
☆ Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing
Xin Guo, Zhiheng Xi, Yiwen Ding, Yitao Zhai, Xiaowei Shi, Xunliang Cai, Tao Gui, Qi Zhang, Xuanjing Huang
Self-improvement has emerged as a mainstream paradigm for advancing the
reasoning capabilities of large vision-language models (LVLMs), where models
explore and learn from successful trajectories iteratively. However, we
identify a critical issue during this process: the model excels at generating
high-quality trajectories for simple queries (i.e., head data) but struggles
with more complex ones (i.e., tail data). This leads to an imbalanced
optimization that drives the model to prioritize simple reasoning skills, while
hindering its ability to tackle more complex reasoning tasks. Over iterations,
this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew
effect"--which ultimately hinders further model improvement and leads to
performance bottlenecks. To counteract this challenge, we introduce four
efficient strategies from two perspectives: distribution-reshaping and
trajectory-resampling, to achieve head-tail re-balancing during the
exploration-and-learning self-improvement process. Extensive experiments on
Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks
demonstrate that our methods consistently improve visual reasoning
capabilities, outperforming vanilla self-improvement by 3.86 points on average.
comment: Preprint
☆ SecureReviewer: Enhancing Large Language Models for Secure Code Review through Secure-aware Fine-tuning ICSE 2026
Identifying and addressing security issues during the early phase of the
development lifecycle is critical for mitigating the long-term negative impacts
on software systems. Code review serves as an effective practice that enables
developers to check their teammates' code before integration into the codebase.
To streamline the generation of review comments, various automated code review
approaches have been proposed, where LLM-based methods have significantly
advanced the capabilities of automated review generation. However, existing
models primarily focus on general-purpose code review, their effectiveness in
identifying and addressing security-related issues remains underexplored.
Moreover, adapting existing code review approaches to target security issues
faces substantial challenges, including data scarcity and inadequate evaluation
metrics. To address these limitations, we propose SecureReviewer, a new
approach designed for enhancing LLMs' ability to identify and resolve
security-related issues during code review. Specifically, we first construct a
dataset tailored for training and evaluating secure code review capabilities.
Leveraging this dataset, we fine-tune LLMs to generate code review comments
that can effectively identify security issues and provide fix suggestions with
our proposed secure-aware fine-tuning strategy. To mitigate hallucination in
LLMs and enhance the reliability of their outputs, we integrate the RAG
technique, which grounds the generated comments in domain-specific security
knowledge. Additionally, we introduce SecureBLEU, a new evaluation metric
designed to assess the effectiveness of review comments in addressing security
issues. Experimental results demonstrate that SecureReviewer outperforms
state-of-the-art baselines in both security issue detection accuracy and the
overall quality and practical utility of generated review comments.
comment: Accepted by ICSE 2026. Code and data:
https://github.com/SIMIAO515/SecureReviewer
☆ Robust Graph Condensation via Classification Complexity Mitigation
Jiayi Luo, Qingyun Sun, Beining Yang, Haonan Yuan, Xingcheng Fu, Yanbiao Ma, Jianxin Li, Philip S. Yu
Graph condensation (GC) has gained significant attention for its ability to
synthesize smaller yet informative graphs. However, existing studies often
overlook the robustness of GC in scenarios where the original graph is
corrupted. In such cases, we observe that the performance of GC deteriorates
significantly, while existing robust graph learning technologies offer only
limited effectiveness. Through both empirical investigation and theoretical
analysis, we reveal that GC is inherently an intrinsic-dimension-reducing
process, synthesizing a condensed graph with lower classification complexity.
Although this property is critical for effective GC performance, it remains
highly vulnerable to adversarial perturbations. To tackle this vulnerability
and improve GC robustness, we adopt the geometry perspective of graph data
manifold and propose a novel Manifold-constrained Robust Graph Condensation
framework named MRGC. Specifically, we introduce three graph data manifold
learning modules that guide the condensed graph to lie within a smooth,
low-dimensional manifold with minimal class ambiguity, thereby preserving the
classification complexity reduction capability of GC and ensuring robust
performance under universal adversarial attacks. Extensive experiments
demonstrate the robustness of \ModelName\ across diverse attack scenarios.
☆ Personalized Treatment Outcome Prediction from Scarce Data via Dual-Channel Knowledge Distillation and Adaptive Fusion
Personalized treatment outcome prediction based on trial data for
small-sample and rare patient groups is critical in precision medicine.
However, the costly trial data limit the prediction performance. To address
this issue, we propose a cross-fidelity knowledge distillation and adaptive
fusion network (CFKD-AFN), which leverages abundant but low-fidelity simulation
data to enhance predictions on scarce but high-fidelity trial data. CFKD-AFN
incorporates a dual-channel knowledge distillation module to extract
complementary knowledge from the low-fidelity model, along with an
attention-guided fusion module to dynamically integrate multi-source
information. Experiments on treatment outcome prediction for the chronic
obstructive pulmonary disease demonstrates significant improvements of CFKD-AFN
over state-of-the-art methods in prediction accuracy, ranging from 6.67\% to
74.55\%, and strong robustness to varying high-fidelity dataset sizes.
Furthermore, we extend CFKD-AFN to an interpretable variant, enabling the
exploration of latent medical semantics to support clinical decision-making.
☆ SSCL-BW: Sample-Specific Clean-Label Backdoor Watermarking for Dataset Ownership Verification
The rapid advancement of deep neural networks (DNNs) heavily relies on
large-scale, high-quality datasets. However, unauthorized commercial use of
these datasets severely violates the intellectual property rights of dataset
owners. Existing backdoor-based dataset ownership verification methods suffer
from inherent limitations: poison-label watermarks are easily detectable due to
label inconsistencies, while clean-label watermarks face high technical
complexity and failure on high-resolution images. Moreover, both approaches
employ static watermark patterns that are vulnerable to detection and removal.
To address these issues, this paper proposes a sample-specific clean-label
backdoor watermarking (i.e., SSCL-BW). By training a U-Net-based watermarked
sample generator, this method generates unique watermarks for each sample,
fundamentally overcoming the vulnerability of static watermark patterns. The
core innovation lies in designing a composite loss function with three
components: target sample loss ensures watermark effectiveness, non-target
sample loss guarantees trigger reliability, and perceptual similarity loss
maintains visual imperceptibility. During ownership verification, black-box
testing is employed to check whether suspicious models exhibit predefined
backdoor behaviors. Extensive experiments on benchmark datasets demonstrate the
effectiveness of the proposed method and its robustness against potential
watermark removal attacks.
comment: 8 pages,9 figures
☆ Chain-of-Thought Hijacking
Large reasoning models (LRMs) achieve higher task performance by allocating
more inference-time compute, and prior works suggest this scaled reasoning may
also strengthen safety by improving refusal. Yet we find the opposite: the same
reasoning can be used to bypass safeguards. We introduce Chain-of-Thought
Hijacking, a jailbreak attack on reasoning models. The attack pads harmful
requests with long sequences of harmless puzzle reasoning. Across HarmBench,
CoT Hijacking reaches a 99%, 94%, 100%, and 94% attack success rate (ASR) on
Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, respectively -
far exceeding prior jailbreak methods for LRMs. To understand the effectiveness
of our attack, we turn to a mechanistic analysis, which shows that mid layers
encode the strength of safety checking, while late layers encode the
verification outcome. Long benign CoT dilutes both signals by shifting
attention away from harmful tokens. Targeted ablations of attention heads
identified by this analysis causally decrease refusal, confirming their role in
a safety subnetwork. These results show that the most interpretable form of
reasoning - explicit CoT - can itself become a jailbreak vector when combined
with final-answer cues. We release prompts, outputs, and judge decisions to
facilitate replication.
☆ LoCoT2V-Bench: A Benchmark for Long-Form and Complex Text-to-Video Generation
Recently text-to-video generation has made impressive progress in producing
short, high-quality clips, but evaluating long-form outputs remains a major
challenge especially when processing complex prompts. Existing benchmarks
mostly rely on simplified prompts and focus on low-level metrics, overlooking
fine-grained alignment with prompts and abstract dimensions such as narrative
coherence and thematic expression. To address these gaps, we propose
LoCoT2V-Bench, a benchmark specifically designed for long video generation
(LVG) under complex input conditions. Based on various real-world videos,
LoCoT2V-Bench introduces a suite of realistic and complex prompts incorporating
elements like scene transitions and event dynamics. Moreover, it constructs a
multi-dimensional evaluation framework that includes our newly proposed metrics
such as event-level alignment, fine-grained temporal consistency, content
clarity, and the Human Expectation Realization Degree (HERD) that focuses on
more abstract attributes like narrative flow, emotional response, and character
development. Using this framework, we conduct a comprehensive evaluation of
nine representative LVG models, finding that while current methods perform well
on basic visual and temporal aspects, they struggle with inter-event
consistency, fine-grained alignment, and high-level thematic adherence, etc.
Overall, LoCoT2V-Bench provides a comprehensive and reliable platform for
evaluating long-form complex text-to-video generation and highlights critical
directions for future method improvement.
☆ MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders
Artificial intelligence in healthcare requires models that are accurate and
interpretable. We advance mechanistic interpretability in medical vision by
applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP,
a vision-language model trained on chest radiographs and reports. To quantify
interpretability, we propose an evaluation framework that combines correlation
metrics, entropy analyzes, and automated neuron naming via the MedGEMMA
foundation model. Experiments on the CheXpert dataset show that MedSAE neurons
achieve higher monosemanticity and interpretability than raw MedCLIP features.
Our findings bridge high-performing medical AI and transparency, offering a
scalable step toward clinically reliable representations.
☆ Human-in-the-loop Online Rejection Sampling for Robotic Manipulation
Reinforcement learning (RL) is widely used to produce robust robotic
manipulation policies, but fine-tuning vision-language-action (VLA) models with
RL can be unstable due to inaccurate value estimates and sparse supervision at
intermediate steps. In contrast, imitation learning (IL) is easy to train but
often underperforms due to its offline nature. In this paper, we propose
Hi-ORS, a simple yet effective post-training method that utilizes rejection
sampling to achieve both training stability and high robustness. Hi-ORS
stabilizes value estimation by filtering out negatively rewarded samples during
online fine-tuning, and adopts a reward-weighted supervised training objective
to provide dense intermediate-step supervision. For systematic study, we
develop an asynchronous inference-training framework that supports flexible
online human-in-the-loop corrections, which serve as explicit guidance for
learning error-recovery behaviors. Across three real-world tasks and two
embodiments, Hi-ORS fine-tunes a pi-base policy to master contact-rich
manipulation in just 1.5 hours of real-world training, outperforming RL and IL
baselines by a substantial margin in both effectiveness and efficiency.
Notably, the fine-tuned policy exhibits strong test-time scalability by
reliably executing complex error-recovery behaviors to achieve better
performance.
comment: 8 pages
☆ Autograder+: A Multi-Faceted AI Framework for Rich Pedagogical Feedback in Programming Education
The rapid growth of programming education has outpaced traditional assessment
tools, leaving faculty with limited means to provide meaningful, scalable
feedback. Conventional autograders, while efficient, act as black-box systems
that simply return pass/fail results, offering little insight into student
thinking or learning needs.
Autograder+ is designed to shift autograding from a purely summative process
to a formative learning experience. It introduces two key capabilities:
automated feedback generation using a fine-tuned Large Language Model, and
visualization of student code submissions to uncover learning patterns. The
model is fine-tuned on curated student code and expert feedback to ensure
pedagogically aligned, context-aware guidance.
In evaluation across 600 student submissions from multiple programming tasks,
the system produced feedback with strong semantic alignment to instructor
comments. For visualization, contrastively learned code embeddings trained on
1,000 annotated submissions enable grouping solutions into meaningful clusters
based on functionality and approach. The system also supports prompt-pooling,
allowing instructors to guide feedback style through selected prompt templates.
By integrating AI-driven feedback, semantic clustering, and interactive
visualization, Autograder+ reduces instructor workload while supporting
targeted instruction and promoting stronger learning outcomes.
☆ A Pragmatic View of AI Personhood
The emergence of agentic Artificial Intelligence (AI) is set to trigger a
"Cambrian explosion" of new kinds of personhood. This paper proposes a
pragmatic framework for navigating this diversification by treating personhood
not as a metaphysical property to be discovered, but as a flexible bundle of
obligations (rights and responsibilities) that societies confer upon entities
for a variety of reasons, especially to solve concrete governance problems. We
argue that this traditional bundle can be unbundled, creating bespoke solutions
for different contexts. This will allow for the creation of practical tools --
such as facilitating AI contracting by creating a target "individual" that can
be sanctioned -- without needing to resolve intractable debates about an AI's
consciousness or rationality. We explore how individuals fit in to social roles
and discuss the use of decentralized digital identity technology, examining
both "personhood as a problem", where design choices can create "dark patterns"
that exploit human social heuristics, and "personhood as a solution", where
conferring a bundle of obligations is necessary to ensure accountability or
prevent conflict. By rejecting foundationalist quests for a single, essential
definition of personhood, this paper offers a more pragmatic and flexible way
to think about integrating AI agents into our society.
comment: 40 pages
☆ SPG-CDENet: Spatial Prior-Guided Cross Dual Encoder Network for Multi-Organ Segmentation
Multi-organ segmentation is a critical task in computer-aided diagnosis.
While recent deep learning methods have achieved remarkable success in image
segmentation, huge variations in organ size and shape challenge their
effectiveness in multi-organ segmentation. To address these challenges, we
propose a Spatial Prior-Guided Cross Dual Encoder Network (SPG-CDENet), a novel
two-stage segmentation paradigm designed to improve multi-organ segmentation
accuracy. Our SPG-CDENet consists of two key components: a spatial prior
network and a cross dual encoder network. The prior network generates coarse
localization maps that delineate the approximate ROI, serving as spatial
guidance for the dual encoder network. The cross dual encoder network comprises
four essential components: a global encoder, a local encoder, a symmetric
cross-attention module, and a flow-based decoder. The global encoder captures
global semantic features from the entire image, while the local encoder focuses
on features from the prior network. To enhance the interaction between the
global and local encoders, a symmetric cross-attention module is proposed
across all layers of the encoders to fuse and refine features. Furthermore, the
flow-based decoder directly propagates high-level semantic features from the
final encoder layer to all decoder layers, maximizing feature preservation and
utilization. Extensive qualitative and quantitative experiments on two public
datasets demonstrate the superior performance of SPG-CDENet compared to
existing segmentation methods. Furthermore, ablation studies further validate
the effectiveness of the proposed modules in improving segmentation accuracy.
☆ Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings
The prohibitive cost of evaluating large language models (LLMs) on
comprehensive benchmarks necessitates the creation of small yet representative
data subsets (i.e., tiny benchmarks) that enable efficient assessment while
retaining predictive fidelity. Current methods for this task operate under a
model-centric paradigm, selecting benchmarking items based on the collective
performance of existing models. Such approaches are limited by large upfront
costs, an inability to immediately handle new benchmarks (`cold-start'), and
the fragile assumption that future models will share the failure patterns of
their predecessors. In this work, we challenge this paradigm and propose a
item-centric approach to benchmark subset selection, arguing that selection
should be based on the intrinsic properties of the task items themselves,
rather than on model-specific failure patterns. We instantiate this
item-centric efficient benchmarking approach via a novel method, Scales++,
where data selection is based on the cognitive demands of the benchmark
samples. Empirically, we show Scales++ reduces the upfront selection cost by
over 18x while achieving competitive predictive fidelity. On the Open LLM
Leaderboard, using just a 0.5\% data subset, we predict full benchmark scores
with a 2.9% mean absolute error. We demonstrate that this item-centric approach
enables more efficient model evaluation without significant fidelity
degradation, while also providing better cold-start performance and more
interpretable benchmarking.
comment: 9 pages, 2 figures, 4 tables
☆ AI Mathematician as a Partner in Advancing Mathematical Discovery -- A Case Study in Homogenization Theory
Artificial intelligence (AI) has demonstrated impressive progress in
mathematical reasoning, yet its integration into the practice of mathematical
research remains limited. In this study, we investigate how the AI
Mathematician (AIM) system can operate as a research partner rather than a mere
problem solver. Focusing on a challenging problem in homogenization theory, we
analyze the autonomous reasoning trajectories of AIM and incorporate targeted
human interventions to structure the discovery process. Through iterative
decomposition of the problem into tractable subgoals, selection of appropriate
analytical methods, and validation of intermediate results, we reveal how human
intuition and machine computation can complement one another. This
collaborative paradigm enhances the reliability, transparency, and
interpretability of the resulting proofs, while retaining human oversight for
formal rigor and correctness. The approach leads to a complete and verifiable
proof, and more broadly, demonstrates how systematic human-AI co-reasoning can
advance the frontier of mathematical discovery.
comment: 52 pages, 1 figure
☆ BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning
Reinforcement finetuning (RFT) is a key technique for aligning Large Language
Models (LLMs) with human preferences and enhancing reasoning, yet its
effectiveness is highly sensitive to which tasks are explored during training.
Uniform task sampling is inefficient, wasting computation on tasks that are
either trivial or unsolvable, while existing task selection methods often
suffer from high rollout costs, poor adaptivity, or incomplete evidence. We
introduce \textbf{BOTS}, a unified framework for \textbf{B}ayesian
\textbf{O}nline \textbf{T}ask \textbf{S}election in LLM reinforcement
finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior
estimates of task difficulty as the model evolves. It jointly incorporates
\emph{explicit evidence} from direct evaluations of selected tasks and
\emph{implicit evidence} inferred from these evaluations for unselected tasks,
with Thompson sampling ensuring a principled balance between exploration and
exploitation. To make implicit evidence practical, we instantiate it with an
ultra-light interpolation-based plug-in that estimates difficulties of
unevaluated tasks without extra rollouts, adding negligible overhead.
Empirically, across diverse domains and LLM scales, BOTS consistently improves
data efficiency and performance over baselines and ablations, providing a
practical and extensible solution for dynamic task selection in RFT.
☆ The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration
While a multi-agent approach based on large language models (LLMs) represents
a promising strategy to surpass the capabilities of single models, its success
is critically dependent on synergistic team composition. However, forming
optimal teams is a significant challenge, as the inherent opacity of most
models obscures the internal characteristics necessary for effective
collaboration. In this paper, we propose an interaction-centric framework for
automatic team composition that does not require any prior knowledge including
their internal architectures, training data, or task performances. Our method
constructs a "language model graph" that maps relationships between models from
the semantic coherence of pairwise conversations, and then applies community
detection to identify synergistic model clusters. Our experiments with diverse
LLMs demonstrate that the proposed method discovers functionally coherent
groups that reflect their latent specializations. Priming conversations with
specific topics identified synergistic teams which outperform random baselines
on downstream benchmarks and achieve comparable accuracy to that of
manually-curated teams based on known model specializations. Our findings
provide a new basis for the automated design of collaborative multi-agent LLM
teams.
☆ Reinforcement Learning for Pollution Detection in a Randomized, Sparse and Nonstationary Environment with an Autonomous Underwater Vehicle
Reinforcement learning (RL) algorithms are designed to optimize
problem-solving by learning actions that maximize rewards, a task that becomes
particularly challenging in random and nonstationary environments. Even
advanced RL algorithms are often limited in their ability to solve problems in
these conditions. In applications such as searching for underwater pollution
clouds with autonomous underwater vehicles (AUVs), RL algorithms must navigate
reward-sparse environments, where actions frequently result in a zero reward.
This paper aims to address these challenges by revisiting and modifying
classical RL approaches to efficiently operate in sparse, randomized, and
nonstationary environments. We systematically study a large number of
modifications, including hierarchical algorithm changes, multigoal learning,
and the integration of a location memory as an external output filter to
prevent state revisits. Our results demonstrate that a modified Monte
Carlo-based approach significantly outperforms traditional Q-learning and two
exhaustive search patterns, illustrating its potential in adapting RL to
complex environments. These findings suggest that reinforcement learning
approaches can be effectively adapted for use in random, nonstationary, and
reward-sparse environments.
☆ Discovering State Equivalences in UCT Search Trees By Action Pruning
One approach to enhance Monte Carlo Tree Search (MCTS) is to improve its
sample efficiency by grouping/abstracting states or state-action pairs and
sharing statistics within a group. Though state-action pair abstractions are
mostly easy to find in algorithms such as On the Go Abstractions in Upper
Confidence bounds applied to Trees (OGA-UCT), nearly no state abstractions are
found in either noisy or large action space settings due to constraining
conditions. We provide theoretical and empirical evidence for this claim, and
we slightly alleviate this state abstraction problem by proposing a weaker
state abstraction condition that trades a minor loss in accuracy for finding
many more abstractions. We name this technique Ideal Pruning Abstractions in
UCT (IPA-UCT), which outperforms OGA-UCT (and any of its derivatives) across a
large range of test domains and iteration budgets as experimentally validated.
IPA-UCT uses a different abstraction framework from Abstraction of State-Action
Pairs (ASAP) which is the one used by OGA-UCT, which we name IPA. Furthermore,
we show that both IPA and ASAP are special cases of a more general framework
that we call p-ASAP which itself is a special case of the ASASAP framework.
☆ MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data
Health-related misinformation is very prevalent and potentially harmful. It
is difficult to identify, especially when claims distort or misinterpret
scientific findings. We investigate the impact of synthetic data generation and
lightweight fine-tuning techniques on the ability of large language models
(LLMs) to recognize fallacious arguments using the MISSCI dataset and
framework. In this work, we propose MisSynth, a pipeline that applies
retrieval-augmented generation (RAG) to produce synthetic fallacy samples,
which are then used to fine-tune an LLM model. Our results show substantial
accuracy gains with fine-tuned models compared to vanilla baselines. For
instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score
absolute improvement on the MISSCI test split over its vanilla baseline. We
demonstrate that introducing synthetic fallacy data to augment limited
annotated resources can significantly enhance zero-shot LLM classification
performance on real-world scientific misinformation tasks, even with limited
computational resources. The code and synthetic dataset are available on
https://github.com/mxpoliakov/MisSynth.
☆ Linear Causal Discovery with Interventional Constraints
Incorporating causal knowledge and mechanisms is essential for refining
causal models and improving downstream tasks such as designing new treatments.
In this paper, we introduce a novel concept in causal discovery, termed
interventional constraints, which differs fundamentally from interventional
data. While interventional data require direct perturbations of variables,
interventional constraints encode high-level causal knowledge in the form of
inequality constraints on causal effects. For instance, in the Sachs dataset
(Sachs et al.\ 2005), Akt has been shown to be activated by PIP3, meaning PIP3
exerts a positive causal effect on Akt. Existing causal discovery methods allow
enforcing structural constraints (for example, requiring a causal path from
PIP3 to Akt), but they may still produce incorrect causal conclusions such as
learning that "PIP3 inhibits Akt". Interventional constraints bridge this gap
by explicitly constraining the total causal effect between variable pairs,
ensuring learned models respect known causal influences. To formalize
interventional constraints, we propose a metric to quantify total causal
effects for linear causal models and formulate the problem as a constrained
optimization task, solved using a two-stage constrained optimization method. We
evaluate our approach on real-world datasets and demonstrate that integrating
interventional constraints not only improves model accuracy and ensures
consistency with established findings, making models more explainable, but also
facilitates the discovery of new causal relationships that would otherwise be
costly to identify.
☆ GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model? ICLR 2026
Image super-resolution(SR) is fundamental to many vision system-from
surveillance and autonomy to document analysis and retail analytics-because
recovering high-frequency details, especially scene-text, enables reliable
downstream perception. Scene-text, i.e., text embedded in natural images such
as signs, product labels, and storefronts, often carries the most actionable
information; when characters are blurred or hallucinated, optical character
recognition(OCR) and subsequent decisions fail even if the rest of the image
appears sharp. Yet previous SR research has often been tuned to distortion
(PSNR/SSIM) or learned perceptual metrics (LIPIS, MANIQA, CLIP-IQA, MUSIQ) that
are largely insensitive to character-level errors. Furthermore, studies that do
address text SR often focus on simplified benchmarks with isolated characters,
overlooking the challenges of text within complex natural scenes. As a result,
scene-text is effectively treated as generic texture. For SR to be effective in
practical deployments, it is therefore essential to explicitly optimize for
both text legibility and perceptual quality. We present GLYPH-SR, a
vision-language-guided diffusion framework that aims to achieve both objectives
jointly. GLYPH-SR utilizes a Text-SR Fusion ControlNet(TS-ControlNet) guided by
OCR data, and a ping-pong scheduler that alternates between text- and
scene-centric guidance. To enable targeted text restoration, we train these
components on a synthetic corpus while keeping the main SR branch frozen.
Across SVT, SCUT-CTW1500, and CUTE80 at x4, and x8, GLYPH-SR improves OCR F1 by
up to +15.18 percentage points over diffusion/GAN baseline (SVT x8, OpenOCR)
while maintaining competitive MANIQA, CLIP-IQA, and MUSIQ. GLYPH-SR is designed
to satisfy both objectives simultaneously-high readability and high visual
realism-delivering SR that looks right and reds right.
comment: 11 pages, 6 figures. Includes supplementary material. Under review as
a conference paper at ICLR 2026
☆ From Amateur to Master: Infusing Knowledge into LLMs via Automated Curriculum Learning
Large Language Models (LLMs) excel at general tasks but underperform in
specialized domains like economics and psychology, which require deep,
principled understanding. To address this, we introduce ACER (Automated
Curriculum-Enhanced Regimen) that transforms generalist models into domain
experts without sacrificing their broad capabilities. ACER first synthesizes a
comprehensive, textbook-style curriculum by generating a table of contents for
a subject and then creating question-answer (QA) pairs guided by Bloom's
taxonomy. This ensures systematic topic coverage and progressively increasing
difficulty. The resulting synthetic corpus is used for continual pretraining
with an interleaved curriculum schedule, aligning learning across both content
and cognitive dimensions.
Experiments with Llama 3.2 (1B and 3B) show significant gains in specialized
MMLU subsets. In challenging domains like microeconomics, where baselines
struggle, ACER boosts accuracy by 5 percentage points. Across all target
domains, we observe a consistent macro-average improvement of 3 percentage
points. Notably, ACER not only prevents catastrophic forgetting but also
facilitates positive cross-domain knowledge transfer, improving performance on
non-target domains by 0.7 points. Beyond MMLU, ACER enhances performance on
knowledge-intensive benchmarks like ARC and GPQA by over 2 absolute points,
while maintaining stable performance on general reasoning tasks. Our results
demonstrate that ACER offers a scalable and effective recipe for closing
critical domain gaps in LLMs.
☆ Posterior Sampling by Combining Diffusion Models with Annealed Langevin Dynamics NeurIPS 2025
Given a noisy linear measurement $y = Ax + \xi$ of a distribution $p(x)$, and
a good approximation to the prior $p(x)$, when can we sample from the posterior
$p(x \mid y)$? Posterior sampling provides an accurate and fair framework for
tasks such as inpainting, deblurring, and MRI reconstruction, and several
heuristics attempt to approximate it. Unfortunately, approximate posterior
sampling is computationally intractable in general.
To sidestep this hardness, we focus on (local or global) log-concave
distributions $p(x)$. In this regime, Langevin dynamics yields posterior
samples when the exact scores of $p(x)$ are available, but it is brittle to
score--estimation error, requiring an MGF bound (sub-exponential error). By
contrast, in the unconditional setting, diffusion models succeed with only an
$L^2$ bound on the score error. We prove that combining diffusion models with
an annealed variant of Langevin dynamics achieves conditional sampling in
polynomial time using merely an $L^4$ bound on the score error.
comment: NeurIPS 2025
☆ GraphCompliance: Aligning Policy and Context Graphs for LLM-Based Regulatory Compliance
Compliance at web scale poses practical challenges: each request may require
a regulatory assessment. Regulatory texts (e.g., the General Data Protection
Regulation, GDPR) are cross-referential and normative, while runtime contexts
are expressed in unstructured natural language. This setting motivates us to
align semantic information in unstructured text with the structured, normative
elements of regulations. To this end, we introduce GraphCompliance, a framework
that represents regulatory texts as a Policy Graph and runtime contexts as a
Context Graph, and aligns them. In this formulation, the policy graph encodes
normative structure and cross-references, whereas the context graph formalizes
events as subject-action-object (SAO) and entity-relation triples. This
alignment anchors the reasoning of a judge large language model (LLM) in
structured information and helps reduce the burden of regulatory interpretation
and event parsing, enabling a focus on the core reasoning step. In experiments
on 300 GDPR-derived real-world scenarios spanning five evaluation tasks,
GraphCompliance yields 4.1-7.2 percentage points (pp) higher micro-F1 than
LLM-only and RAG baselines, with fewer under- and over-predictions, resulting
in higher recall and lower false positive rates. Ablation studies indicate
contributions from each graph component, suggesting that structured
representations and a judge LLM are complementary for normative reasoning.
comment: Under review at The Web Conference 2026 (Semantics & Knowledge
track). Code will be released upon acceptance. This arXiv v1 contains no
repository links to preserve double-blind review
☆ Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime
Adam [Kingma and Ba, 2015] is the de facto optimizer in deep learning, yet
its theoretical understanding remains limited. Prior analyses show that Adam
favors solutions aligned with $\ell_\infty$-geometry, but these results are
restricted to the full-batch regime. In this work, we study the implicit bias
of incremental Adam (using one sample per step) for logistic regression on
linearly separable data, and we show that its bias can deviate from the
full-batch behavior. To illustrate this, we construct a class of structured
datasets where incremental Adam provably converges to the $\ell_2$-max-margin
classifier, in contrast to the $\ell_\infty$-max-margin bias of full-batch
Adam. For general datasets, we develop a proxy algorithm that captures the
limiting behavior of incremental Adam as $\beta_2 \to 1$ and we characterize
its convergence direction via a data-dependent dual fixed-point formulation.
Finally, we prove that, unlike Adam, Signum [Bernstein et al., 2018] converges
to the $\ell_\infty$-max-margin classifier for any batch size by taking $\beta$
close enough to 1. Overall, our results highlight that the implicit bias of
Adam crucially depends on both the batching scheme and the dataset, while
Signum remains invariant.
comment: 50 pages
☆ Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens
Contrastive Language-Image Pre-training (CLIP) delivers strong cross modal
generalization by aligning images and texts in a shared embedding space, yet it
persistently fails at compositional reasoning over objects, attributes, and
relations often behaving like a bag-of-words matcher. Prior causal accounts
typically model text as a single vector, obscuring token-level structure and
leaving core phenomena-such as prompt sensitivity and failures on hard
negatives unexplained. We address this gap with a token-aware causal
representation learning (CRL) framework grounded in a sequential,
language-token SCM. Our theory extends block identifiability to tokenized text,
proving that CLIP's contrastive objective can recover the modal-invariant
latent variable under both sentence-level and token-level SCMs. Crucially,
token granularity yields the first principled explanation of CLIP's
compositional brittleness: composition nonidentifiability. We show the
existence of pseudo-optimal text encoders that achieve perfect modal-invariant
alignment yet are provably insensitive to SWAP, REPLACE, and ADD operations
over atomic concepts, thereby failing to distinguish correct captions from hard
negatives despite optimizing the same training objective as true-optimal
encoders. The analysis further links language-side nonidentifiability to
visual-side failures via the modality gap and shows how iterated composition
operators compound hardness, motivating improved negative mining strategies.
☆ Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games
OpenAI's ChatGPT Atlas introduces new capabilities for web interaction,
enabling the model to analyze webpages, process user intents, and execute
cursor and keyboard inputs directly within the browser. While its capacity for
information retrieval tasks has been demonstrated, its performance in dynamic,
interactive environments remains less explored. In this study, we conduct an
early evaluation of Atlas's web interaction capabilities using browser-based
games as test scenarios, including Google's T-Rex Runner, Sudoku, Flappy Bird,
and Stein.world. We employ in-game performance scores as quantitative metrics
to assess performance across different task types. Our results show that Atlas
performs strongly in logical reasoning tasks like Sudoku, completing puzzles
significantly faster than human baselines, but struggles substantially in
real-time games requiring precise timing and motor control, often failing to
progress beyond initial obstacles. These findings suggest that while Atlas
demonstrates capable analytical processing, there remain notable limitations in
dynamic web environments requiring real-time interaction. The website of our
project can be found at https://atlas-game-eval.github.io.
☆ Unravelling the Mechanisms of Manipulating Numbers in Language Models
Michal Štefánik, Timothee Mickus, Marek Kadlčík, Bertram Højer, Michal Spiegel, Raúl Vázquez, Aman Sinha, Josef Kuchař, Philipp Mondorf
Recent work has shown that different large language models (LLMs) converge to
similar and accurate input embedding representations for numbers. These
findings conflict with the documented propensity of LLMs to produce erroneous
outputs when dealing with numeric information. In this work, we aim to explain
this conflict by exploring how language models manipulate numbers and quantify
the lower bounds of accuracy of these mechanisms. We find that despite
surfacing errors, different language models learn interchangeable
representations of numbers that are systematic, highly accurate and universal
across their hidden states and the types of input contexts. This allows us to
create universal probes for each LLM and to trace information -- including the
causes of output errors -- to specific layers. Our results lay a fundamental
understanding of how pre-trained LLMs manipulate numbers and outline the
potential of more accurate probing techniques in addressed refinements of LLMs'
architectures.
☆ Distributional Multi-objective Black-box Optimization for Diffusion-model Inference-time Multi-Target Generation
Diffusion models have been successful in learning complex data distributions.
This capability has driven their application to high-dimensional
multi-objective black-box optimization problem. Existing approaches often
employ an external optimization loop, such as an evolutionary algorithm, to the
diffusion model. However, these approaches treat the diffusion model as a
black-box refiner, which overlooks the internal distribution transition of the
diffusion generation process, limiting their efficiency. To address these
challenges, we propose the Inference-time Multi-target Generation (IMG)
algorithm, which optimizes the diffusion process at inference-time to generate
samples that simultaneously satisfy multiple objectives. Specifically, our IMG
performs weighted resampling during the diffusion generation process according
to the expected aggregated multi-objective values. This weighted resampling
strategy ensures the diffusion-generated samples are distributed according to
our desired multi-target Boltzmann distribution. We further derive that the
multi-target Boltzmann distribution has an interesting log-likelihood
interpretation, where it is the optimal solution to the distributional
multi-objective optimization problem. We implemented IMG for a multi-objective
molecule generation task. Experiments show that IMG, requiring only a single
generation pass, achieves a significantly higher hypervolume than baseline
optimization algorithms that often require hundreds of diffusion generations.
Notably, our algorithm can be viewed as an optimized diffusion process and can
be integrated into existing methods to further improve their performance.
☆ A Research Roadmap for Augmenting Software Engineering Processes and Software Products with Generative AI
Domenico Amalfitano, Andreas Metzger, Marco Autili, Tommaso Fulcini, Tobias Hey, Jan Keim, Patrizio Pelliccione, Vincenzo Scotti, Anne Koziolek, Raffaela Mirandola, Andreas Vogelsang
Generative AI (GenAI) is rapidly transforming software engineering (SE)
practices, influencing how SE processes are executed, as well as how software
systems are developed, operated, and evolved. This paper applies design science
research to build a roadmap for GenAI-augmented SE. The process consists of
three cycles that incrementally integrate multiple sources of evidence,
including collaborative discussions from the FSE 2025 "Software Engineering
2030" workshop, rapid literature reviews, and external feedback sessions
involving peers. McLuhan's tetrads were used as a conceptual instrument to
systematically capture the transforming effects of GenAI on SE processes and
software products.The resulting roadmap identifies four fundamental forms of
GenAI augmentation in SE and systematically characterizes their related
research challenges and opportunities. These insights are then consolidated
into a set of future research directions. By grounding the roadmap in a
rigorous multi-cycle process and cross-validating it among independent author
teams and peers, the study provides a transparent and reproducible foundation
for analyzing how GenAI affects SE processes, methods and tools, and for
framing future research within this rapidly evolving area. Based on these
findings, the article finally makes ten predictions for SE in the year 2030.
☆ Graph-Enhanced Policy Optimization in LLM Agent Training
Group based reinforcement learning (RL) has shown impressive results on
complex reasoning and mathematical tasks. Yet, when applied to train
multi-turn, interactive LLM agents, these methods often suffer from structural
blindness-the inability to exploit the underlying connectivity of the
environment. This manifests in three critical challenges: (1) inefficient,
unguided exploration, (2) imprecise credit assignment due to overlooking
pivotal states, and (3) myopic planning caused by static reward discounting. We
address these issues with Graph-Enhanced Policy Optimization (GEPO), which
dynamically constructs a state-transition graph from agent experience and
employs graph-theoretic centrality to provide three synergistic learning
signals: (1)structured intrinsic rewards that guide exploration toward
high-impact states, (2) a graph-enhanced advantage function for topology-aware
credit assignment, and (3) a dynamic discount factor adapted to each state's
strategic value. On the ALFWorld, WebShop, and a proprietary Workbench
benchmarks, GEPO demonstrates strong performance, achieving absolute success
rate gains of +4.1%, +5.3%, and +10.9% over competitive baselines. These
results highlight that explicitly modeling environmental structure is a robust,
generalizable strategy for advancing LLM agent training.
comment: Under review as a conference paper
☆ Angular Steering: Behavior Control via Rotation in Activation Space NeurIPS 2025
Controlling specific behaviors in large language models while preserving
their general capabilities is a central challenge for safe and reliable
artificial intelligence deployment. Current steering methods, such as vector
addition and directional ablation, are constrained within a two-dimensional
subspace defined by the activation and feature direction, making them sensitive
to chosen parameters and potentially affecting unrelated features due to
unintended interactions in activation space. We introduce Angular Steering, a
novel and flexible method for behavior modulation that operates by rotating
activations within a fixed two-dimensional subspace. By formulating steering as
a geometric rotation toward or away from a target behavior direction, Angular
Steering provides continuous, fine-grained control over behaviors such as
refusal and compliance. We demonstrate this method using refusal steering
emotion steering as use cases. Additionally, we propose Adaptive Angular
Steering, a selective variant that rotates only activations aligned with the
target feature, further enhancing stability and coherence. Angular Steering
generalizes existing addition and orthogonalization techniques under a unified
geometric rotation framework, simplifying parameter selection and maintaining
model stability across a broader range of adjustments. Experiments across
multiple model families and sizes show that Angular Steering achieves robust
behavioral control while maintaining general language modeling performance,
underscoring its flexibility, generalization, and robustness compared to prior
approaches. Code and artifacts are available at
https://github.com/lone17/angular-steering/.
comment: NeurIPS 2025 (Spotlight)
☆ Retrieval Augmented Generation-Enhanced Distributed LLM Agents for Generalizable Traffic Signal Control with Emergency Vehicles
With increasing urban traffic complexity, Traffic Signal Control (TSC) is
essential for optimizing traffic flow and improving road safety. Large Language
Models (LLMs) emerge as promising approaches for TSC. However, they are prone
to hallucinations in emergencies, leading to unreliable decisions that may
cause substantial delays for emergency vehicles. Moreover, diverse intersection
types present substantial challenges for traffic state encoding and
cross-intersection training, limiting generalization across heterogeneous
intersections. Therefore, this paper proposes Retrieval Augmented Generation
(RAG)-enhanced distributed LLM agents with Emergency response for Generalizable
TSC (REG-TSC). Firstly, this paper presents an emergency-aware reasoning
framework, which dynamically adjusts reasoning depth based on the emergency
scenario and is equipped with a novel Reviewer-based Emergency RAG (RERAG) to
distill specific knowledge and guidance from historical cases, enhancing the
reliability and rationality of agents' emergency decisions. Secondly, this
paper designs a type-agnostic traffic representation and proposes a
Reward-guided Reinforced Refinement (R3) for heterogeneous intersections. R3
adaptively samples training experience from diverse intersections with
environment feedback-based priority and fine-tunes LLM agents with a designed
reward-weighted likelihood loss, guiding REG-TSC toward high-reward policies
across heterogeneous intersections. On three real-world road networks with 17
to 177 heterogeneous intersections, extensive experiments show that REG-TSC
reduces travel time by 42.00%, queue length by 62.31%, and emergency vehicle
waiting time by 83.16%, outperforming other state-of-the-art methods.
☆ Questionnaire meets LLM: A Benchmark and Empirical Study of Structural Skills for Understanding Questions and Responses
Millions of people take surveys every day, from market polls and academic
studies to medical questionnaires and customer feedback forms. These datasets
capture valuable insights, but their scale and structure present a unique
challenge for large language models (LLMs), which otherwise excel at few-shot
reasoning over open-ended text. Yet, their ability to process questionnaire
data or lists of questions crossed with hundreds of respondent rows remains
underexplored. Current retrieval and survey analysis tools (e.g., Qualtrics,
SPSS, REDCap) are typically designed for humans in the workflow, limiting such
data integration with LLM and AI-empowered automation. This gap leaves
scientists, surveyors, and everyday users without evidence-based guidance on
how to best represent questionnaires for LLM consumption. We address this by
introducing QASU (Questionnaire Analysis and Structural Understanding), a
benchmark that probes six structural skills, including answer lookup,
respondent count, and multi-hop inference, across six serialization formats and
multiple prompt strategies. Experiments on contemporary LLMs show that choosing
an effective format and prompt combination can improve accuracy by up to 8.8%
points compared to suboptimal formats. For specific tasks, carefully adding a
lightweight structural hint through self-augmented prompting can yield further
improvements of 3-4% points on average. By systematically isolating format and
prompting effects, our open source benchmark offers a simple yet versatile
foundation for advancing both research and real-world practice in LLM-based
questionnaire analysis.
comment: 14 pages, 3 figures, 8 tables
☆ MPRU: Modular Projection-Redistribution Unlearning as Output Filter for Classification Pipelines
As a new and promising approach, existing machine unlearning (MU) works
typically emphasize theoretical formulations or optimization objectives to
achieve knowledge removal. However, when deployed in real-world scenarios, such
solutions typically face scalability issues and have to address practical
requirements such as full access to original datasets and model. In contrast to
the existing approaches, we regard classification training as a sequential
process where classes are learned sequentially, which we call \emph{inductive
approach}. Unlearning can then be done by reversing the last training sequence.
This is implemented by appending a projection-redistribution layer in the end
of the model. Such an approach does not require full access to the original
dataset or the model, addressing the challenges of existing methods. This
enables modular and model-agnostic deployment as an output filter into existing
classification pipelines with minimal alterations. We conducted multiple
experiments across multiple datasets including image (CIFAR-10/100 using
CNN-based model) and tabular datasets (Covertype using tree-based model).
Experiment results show consistently similar output to a fully retrained model
with a high computational cost reduction. This demonstrates the applicability,
scalability, and system compatibility of our solution while maintaining the
performance of the output in a more practical setting.
comment: 10 pages, 6 figures
☆ Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space
Test-time alignment of large language models (LLMs) attracts attention
because fine-tuning LLMs requires high computational costs. In this paper, we
propose a new test-time alignment method called adaptive importance sampling on
pre-logits (AISP) on the basis of the sampling-based model predictive control
with the stochastic control input. AISP applies the Gaussian perturbation into
pre-logits, which are outputs of the penultimate layer, so as to maximize
expected rewards with respect to the mean of the perturbation. We demonstrate
that the optimal mean is obtained by importance sampling with sampled rewards.
AISP outperforms best-of-n sampling in terms of rewards over the number of used
samples and achieves higher rewards than other reward-based test-time alignment
methods.
comment: 21 pages, 8 figures
☆ Hybrid LLM and Higher-Order Quantum Approximate Optimization for CSA Collateral Management
We address finance-native collateral optimization under ISDA Credit Support
Annexes (CSAs), where integer lots, Schedule A haircuts, RA/MTA gating, and
issuer/currency/class caps create rugged, legally bounded search spaces. We
introduce a certifiable hybrid pipeline purpose-built for this domain: (i) an
evidence-gated LLM that extracts CSA terms to a normalized JSON
(abstain-by-default, span-cited); (ii) a quantum-inspired explorer that
interleaves simulated annealing with micro higher order QAOA (HO-QAOA) on
binding sub-QUBOs (subset size n <= 16, order k <= 4) to coordinate multi-asset
moves across caps and RA-induced discreteness; (iii) a weighted risk-aware
objective (Movement, CVaR, funding-priced overshoot) with an explicit coverage
window U <= Reff+B; and (iv) CP-SAT as single arbiter to certify feasibility
and gaps, including a U-cap pre-check that reports the minimal feasible buffer
B*. Encoding caps/rounding as higher-order terms lets HO-QAOA target the domain
couplings that defeat local swaps. On government bond datasets and multi-CSA
inputs, the hybrid improves a strong classical baseline (BL-3) by 9.1%, 9.6%,
and 10.7% across representative harnesses, delivering better cost-movement-tail
frontiers under governance settings. We release governance grade artifacts-span
citations, valuation matrix audit, weight provenance, QUBO manifests, and
CP-SAT traces-to make results auditable and reproducible.
comment: 6 pages
☆ Towards Global Retrieval Augmented Generation: A Benchmark for Corpus-Level Reasoning
Retrieval-augmented generation (RAG) has emerged as a leading approach to
reducing hallucinations in large language models (LLMs). Current RAG evaluation
benchmarks primarily focus on what we call local RAG: retrieving relevant
chunks from a small subset of documents to answer queries that require only
localized understanding within specific text chunks. However, many real-world
applications require a fundamentally different capability -- global RAG --
which involves aggregating and analyzing information across entire document
collections to derive corpus-level insights (for example, "What are the top 10
most cited papers in 2023?"). In this paper, we introduce GlobalQA -- the first
benchmark specifically designed to evaluate global RAG capabilities, covering
four core task types: counting, extremum queries, sorting, and top-k
extraction. Through systematic evaluation across different models and
baselines, we find that existing RAG methods perform poorly on global tasks,
with the strongest baseline achieving only 1.51 F1 score. To address these
challenges, we propose GlobalRAG, a multi-tool collaborative framework that
preserves structural coherence through chunk-level retrieval, incorporates
LLM-driven intelligent filters to eliminate noisy documents, and integrates
aggregation modules for precise symbolic computation. On the Qwen2.5-14B model,
GlobalRAG achieves 6.63 F1 compared to the strongest baseline's 1.51 F1,
validating the effectiveness of our method.
☆ What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data
Human feedback can alter language models in unpredictable and undesirable
ways, as practitioners lack a clear understanding of what feedback data
encodes. While prior work studies preferences over certain attributes (e.g.,
length or sycophancy), automatically extracting relevant features without
pre-specifying hypotheses remains challenging. We introduce What's In My Human
Feedback? (WIMHF), a method to explain feedback data using sparse autoencoders.
WIMHF characterizes both (1) the preferences a dataset is capable of measuring
and (2) the preferences that the annotators actually express. Across 7
datasets, WIMHF identifies a small number of human-interpretable features that
account for the majority of the preference prediction signal achieved by
black-box models. These features reveal a wide diversity in what humans prefer,
and the role of dataset-level context: for example, users on Reddit prefer
informality and jokes, while annotators in HH-RLHF and PRISM disprefer them.
WIMHF also surfaces potentially unsafe preferences, such as that LMArena users
tend to vote against refusals, often in favor of toxic content. The learned
features enable effective data curation: re-labeling the harmful examples in
Arena yields large safety gains (+37%) with no cost to general performance.
They also allow fine-grained personalization: on the Community Alignment
dataset, we learn annotator-specific weights over subjective features that
improve preference prediction. WIMHF provides a human-centered analysis method
for practitioners to better understand and use preference data.
comment: Code: https://github.com/rmovva/wimhf
☆ Don't Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation NeurIPS 2025
While diffusion language models (DLMs) enable fine-grained refinement, their
practical controllability remains fragile. We identify and formally
characterize a central failure mode called update forgetting, in which uniform
and context agnostic updates induce token level fluctuations across timesteps,
erasing earlier semantic edits and disrupting the cumulative refinement
process, thereby degrading fluency and coherence. As this failure originates in
uniform and context agnostic updates, effective control demands explicit token
ordering. We propose Token Timestep Allocation (TTA), which realizes soft and
semantic token ordering via per token timestep schedules: critical tokens are
frozen early, while uncertain tokens receive continued refinement. This
timestep based ordering can be instantiated as either a fixed policy or an
adaptive policy driven by task signals, thereby supporting a broad spectrum of
refinement strategies. Because it operates purely at inference time, it applies
uniformly across various DLMs and naturally extends to diverse supervision
sources. Empirically, TTA improves controllability and fluency: on sentiment
control, it yields more than 20 percent higher accuracy and nearly halves
perplexity using less than one fifth the steps; in detoxification, it lowers
maximum toxicity (12.2 versus 14.5) and perplexity (26.0 versus 32.0).
Together, these results demonstrate that softened ordering via timestep
allocation is the critical lever for mitigating update forgetting and achieving
stable and controllable diffusion text generation.
comment: Accepted in NeurIPS 2025
☆ Predicting All-Cause Hospital Readmissions from Medical Claims Data of Hospitalised Patients
Reducing preventable hospital readmissions is a national priority for payers,
providers, and policymakers seeking to improve health care and lower costs. The
rate of readmission is being used as a benchmark to determine the quality of
healthcare provided by the hospitals. In thisproject, we have used machine
learning techniques like Logistic Regression, Random Forest and Support Vector
Machines to analyze the health claims data and identify demographic and medical
factors that play a crucial role in predicting all-cause readmissions. As the
health claims data is high dimensional, we have used Principal Component
Analysis as a dimension reduction technique and used the results for building
regression models. We compared and evaluated these models based on the Area
Under Curve (AUC) metric. Random Forest model gave the highest performance
followed by Logistic Regression and Support Vector Machine models. These models
can be used to identify the crucial factors causing readmissions and help
identify patients to focus on to reduce the chances of readmission, ultimately
bringing down the cost and increasing the quality of healthcare provided to the
patients.
comment: NCMLAI 2018
☆ ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts NeurIPS 2025
Dataset bias, where data points are skewed to certain concepts, is ubiquitous
in machine learning datasets. Yet, systematically identifying these biases is
challenging without costly, fine-grained attribute annotations. We present
ConceptScope, a scalable and automated framework for analyzing visual datasets
by discovering and quantifying human-interpretable concepts using Sparse
Autoencoders trained on representations from vision foundation models.
ConceptScope categorizes concepts into target, context, and bias types based on
their semantic relevance and statistical correlation to class labels, enabling
class-level dataset characterization, bias identification, and robustness
evaluation through concept-based subgrouping. We validate that ConceptScope
captures a wide range of visual concepts, including objects, textures,
backgrounds, facial attributes, emotions, and actions, through comparisons with
annotated datasets. Furthermore, we show that concept activations produce
spatial attributions that align with semantically meaningful image regions.
ConceptScope reliably detects known biases (e.g., background bias in
Waterbirds) and uncovers previously unannotated ones (e.g, co-occurring objects
in ImageNet), offering a practical tool for dataset auditing and model
diagnostics.
comment: Published in the Thirty-Ninth Conference on Neural Information
Processing Systems (NeurIPS 2025)
☆ Accumulative SGD Influence Estimation for Data Attribution
Modern data-centric AI needs precise per-sample influence. Standard SGD-IE
approximates leave-one-out effects by summing per-epoch surrogates and ignores
cross-epoch compounding, which misranks critical examples. We propose
ACC-SGD-IE, a trajectory-aware estimator that propagates the leave-one-out
perturbation across training and updates an accumulative influence state at
each step. In smooth strongly convex settings it achieves geometric error
contraction and, in smooth non-convex regimes, it tightens error bounds; larger
mini-batches further reduce constants. Empirically, on Adult, 20 Newsgroups,
and MNIST under clean and corrupted data and both convex and non-convex
training, ACC-SGD-IE yields more accurate influence estimates, especially over
long epochs. For downstream data cleansing it more reliably flags noisy
samples, producing models trained on ACC-SGD-IE cleaned data that outperform
those cleaned with SGD-IE.
☆ Linking Heterogeneous Data with Coordinated Agent Flows for Social Media Analysis
Social media platforms generate massive volumes of heterogeneous data,
capturing user behaviors, textual content, temporal dynamics, and network
structures. Analyzing such data is crucial for understanding phenomena such as
opinion dynamics, community formation, and information diffusion. However,
discovering insights from this complex landscape is exploratory, conceptually
challenging, and requires expertise in social media mining and visualization.
Existing automated approaches, though increasingly leveraging large language
models (LLMs), remain largely confined to structured tabular data and cannot
adequately address the heterogeneity of social media analysis. We present SIA
(Social Insight Agents), an LLM agent system that links heterogeneous
multi-modal data -- including raw inputs (e.g., text, network, and behavioral
data), intermediate outputs, mined analytical results, and visualization
artifacts -- through coordinated agent flows. Guided by a bottom-up taxonomy
that connects insight types with suitable mining and visualization techniques,
SIA enables agents to plan and execute coherent analysis strategies. To ensure
multi-modal integration, it incorporates a data coordinator that unifies
tabular, textual, and network data into a consistent flow. Its interactive
interface provides a transparent workflow where users can trace, validate, and
refine the agent's reasoning, supporting both adaptability and trustworthiness.
Through expert-centered case studies and quantitative evaluation, we show that
SIA effectively discovers diverse and meaningful insights from social media
while supporting human-agent collaboration in complex analytical tasks.
☆ One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning
Reward models (RMs) play a critical role in aligning large language models
(LLMs) with human preferences. Yet in the domain of tool learning, the lack of
RMs specifically designed for function-calling tasks has limited progress
toward more capable agentic AI. We introduce ToolRM, a family of lightweight
generative RMs tailored for general tool-use scenarios. To build these models,
we propose a novel pipeline that constructs pairwise preference data using
rule-based scoring and multidimensional sampling. This yields
ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique
tasks that supports reinforcement learning with verifiable feedback. To
evaluate tool-use RMs, we also introduce TRBench$_{BFCL}$, a benchmark built on
the agentic evaluation suite BFCL. Trained on our constructed data, models from
the Qwen3-4B/8B series achieve up to 14.28% higher accuracy, substantially
outperforming frontier models such as Claude 4 and OpenAI o3 in pairwise reward
judgments. Beyond training objectives, ToolRM generalizes to broader critique
tasks, including Best-of-N sampling and self-correction. Experiments on
ACEBench highlight its effectiveness and efficiency, enabling inference-time
scaling and reducing output token usage by over 66%. We release data and model
checkpoints to facilitate future research.
☆ Learning to Manage Investment Portfolios beyond Simple Utility Functions
While investment funds publicly disclose their objectives in broad terms,
their managers optimize for complex combinations of competing goals that go
beyond simple risk-return trade-offs. Traditional approaches attempt to model
this through multi-objective utility functions, but face fundamental challenges
in specification and parameterization. We propose a generative framework that
learns latent representations of fund manager strategies without requiring
explicit utility specification.
Our approach directly models the conditional probability of a fund's
portfolio weights, given stock characteristics, historical returns, previous
weights, and a latent variable representing the fund's strategy. Unlike methods
based on reinforcement learning or imitation learning, which require specified
rewards or labeled expert objectives, our GAN-based architecture learns
directly from the joint distribution of observed holdings and market data.
We validate our framework on a dataset of 1436 U.S. equity mutual funds. The
learned representations successfully capture known investment styles, such as
"growth" and "value," while also revealing implicit manager objectives. For
instance, we find that while many funds exhibit characteristics of
Markowitz-like optimization, they do so with heterogeneous realizations for
turnover, concentration, and latent factors.
To analyze and interpret the end-to-end model, we develop a series of tests
that explain the model, and we show that the benchmark's expert labeling are
contained in our model's encoding in a linear interpretable way.
Our framework provides a data-driven approach for characterizing investment
strategies for applications in market simulation, strategy attribution, and
regulatory oversight.
comment: 6th ACM International Conference on AI in Finance, November 15-18,
2025, Singapore
☆ Segmentation over Complexity: Evaluating Ensemble and Hybrid Approaches for Anomaly Detection in Industrial Time Series
In this study, we investigate the effectiveness of advanced feature
engineering and hybrid model architectures for anomaly detection in a
multivariate industrial time series, focusing on a steam turbine system. We
evaluate the impact of change point-derived statistical features,
clustering-based substructure representations, and hybrid learning strategies
on detection performance. Despite their theoretical appeal, these complex
approaches consistently underperformed compared to a simple Random Forest +
XGBoost ensemble trained on segmented data. The ensemble achieved an AUC-ROC of
0.976, F1-score of 0.41, and 100% early detection within the defined time
window. Our findings highlight that, in scenarios with highly imbalanced and
temporally uncertain data, model simplicity combined with optimized
segmentation can outperform more sophisticated architectures, offering greater
robustness, interpretability, and operational utility.
comment: This paper is currently under review for presentation at the IEEE
SAMI 2026 Conference
☆ Bridging the Gap Between Molecule and Textual Descriptions via Substructure-aware Alignment EMNLP 2025
Molecule and text representation learning has gained increasing interest due
to its potential for enhancing the understanding of chemical information.
However, existing models often struggle to capture subtle differences between
molecules and their descriptions, as they lack the ability to learn
fine-grained alignments between molecular substructures and chemical phrases.
To address this limitation, we introduce MolBridge, a novel molecule-text
learning framework based on substructure-aware alignments. Specifically, we
augment the original molecule-description pairs with additional alignment
signals derived from molecular substructures and chemical phrases. To
effectively learn from these enriched alignments, MolBridge employs
substructure-aware contrastive learning, coupled with a self-refinement
mechanism that filters out noisy alignment signals. Experimental results show
that MolBridge effectively captures fine-grained correspondences and
outperforms state-of-the-art baselines on a wide range of molecular benchmarks,
highlighting the significance of substructure-aware alignment in molecule-text
learning.
comment: EMNLP 2025 (main)
☆ MV-MLM: Bridging Multi-View Mammography and Language for Breast Cancer Diagnosis and Risk Prediction ICCV 2025
Large annotated datasets are essential for training robust Computer-Aided
Diagnosis (CAD) models for breast cancer detection or risk prediction. However,
acquiring such datasets with fine-detailed annotation is both costly and
time-consuming. Vision-Language Models (VLMs), such as CLIP, which are
pre-trained on large image-text pairs, offer a promising solution by enhancing
robustness and data efficiency in medical imaging tasks. This paper introduces
a novel Multi-View Mammography and Language Model for breast cancer
classification and risk prediction, trained on a dataset of paired mammogram
images and synthetic radiology reports. Our MV-MLM leverages multi-view
supervision to learn rich representations from extensive radiology data by
employing cross-modal self-supervision across image-text pairs. This includes
multiple views and the corresponding pseudo-radiology reports. We propose a
novel joint visual-textual learning strategy to enhance generalization and
accuracy performance over different data types and tasks to distinguish breast
tissues or cancer characteristics(calcification, mass) and utilize these
patterns to understand mammography images and predict cancer risk. We evaluated
our method on both private and publicly available datasets, demonstrating that
the proposed model achieves state-of-the-art performance in three
classification tasks: (1) malignancy classification, (2) subtype
classification, and (3) image-based cancer risk prediction. Furthermore, the
model exhibits strong data efficiency, outperforming existing fully supervised
or VLM baselines while trained on synthetic text reports and without the need
for actual radiology reports.
comment: Accepted to Computer Vision for Automated Medical Diagnosis (CVAMD)
Workshop at ICCV 2025
☆ The FM Agent
Annan Li, Chufan Wu, Zengle Ge, Yee Hin Chong, Zhinan Hou, Lizhe Cao, Cheng Ju, Jianmin Wu, Huaiming Li, Haobo Zhang, Shenghao Feng, Mo Zhao, Fengzhi Qiu, Rui Yang, Mengmeng Zhang, Wenyi Zhu, Yingying Sun, Quan Sun, Shunhao Yan, Danyu Liu, Dawei Yin, Dou Shen
Large language models (LLMs) are catalyzing the development of autonomous AI
research agents for scientific and engineering discovery. We present FM Agent,
a novel and general-purpose multi-agent framework that leverages a synergistic
combination of LLM-based reasoning and large-scale evolutionary search to
address complex real-world challenges. The core of FM Agent integrates several
key innovations: 1) a cold-start initialization phase incorporating expert
guidance, 2) a novel evolutionary sampling strategy for iterative optimization,
3) domain-specific evaluators that combine correctness, effectiveness, and
LLM-supervised feedback, and 4) a distributed, asynchronous execution
infrastructure built on Ray. Demonstrating broad applicability, our system has
been evaluated across diverse domains, including operations research, machine
learning, GPU kernel optimization, and classical mathematical problems. FM
Agent reaches state-of-the-art results autonomously, without human
interpretation or tuning -- 1976.3 on ALE-Bench (+5.2\%), 43.56\% on MLE-Bench
(+4.0pp), up to 20x speedups on KernelBench, and establishes new
state-of-the-art(SOTA) results on several classical mathematical problems.
Beyond academic benchmarks, FM Agent shows considerable promise for both
large-scale enterprise R\&D workflows and fundamental scientific research,
where it can accelerate innovation, automate complex discovery processes, and
deliver substantial engineering and scientific advances with broader societal
impact.
☆ Reasoning Curriculum: Bootstrapping Broad LLM Reasoning from Math
Reinforcement learning (RL) can elicit strong reasoning in large language
models (LLMs), yet most open efforts focus on math and code. We propose
Reasoning Curriculum, a simple two-stage curriculum that first elicits
reasoning skills in pretraining-aligned domains such as math, then adapts and
refines these skills across other domains via joint RL. Stage 1 performs a
brief cold start and then math-only RL with verifiable rewards to develop
reasoning skills. Stage 2 runs joint RL on mixed-domain data to transfer and
consolidate these skills. The curriculum is minimal and backbone-agnostic,
requiring no specialized reward models beyond standard verifiability checks.
Evaluated on Qwen3-4B and Llama-3.1-8B over a multi-domain suite, reasoning
curriculum yields consistent gains. Ablations and a cognitive-skill analysis
indicate that both stages are necessary and that math-first elicitation
increases cognitive behaviors important for solving complex problems. Reasoning
Curriculum provides a compact, easy-to-adopt recipe for general reasoning.
comment: 9 pages
☆ Beyond Benchmarks: The Economics of AI Inference
Boqin Zhuang, Jiacheng Qiao, Mingqian Liu, Mingxing Yu, Ping Hong, Rui Li, Xiaoxia Song, Xiangjun Xu, Xu Chen, Yaoyao Ma, Yujie Gao
The inference cost of Large Language Models (LLMs) has become a critical
factor in determining their commercial viability and widespread adoption. This
paper introduces a quantitative ``economics of inference'' framework, treating
the LLM inference process as a compute-driven intelligent production activity.
We analyze its marginal cost, economies of scale, and quality of output under
various performance configurations. Based on empirical data from WiNEval-3.0,
we construct the first ``LLM Inference Production Frontier,'' revealing three
principles: diminishing marginal cost, diminishing returns to scale, and an
optimal cost-effectiveness zone. This paper not only provides an economic basis
for model deployment decisions but also lays an empirical foundation for the
future market-based pricing and optimization of AI inference resources.
☆ Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation
Large language models (LLMs) have advanced code generation at the function
level, yet their ability to produce correct class-level implementations in
authentic software projects remains poorly understood. This work introduces a
novel benchmark derived from open-source repositories, comprising real-world
classes divided into seen and unseen partitions to evaluate generalization
under practical conditions. The evaluation examines multiple LLMs under varied
input specifications, retrieval-augmented configurations, and documentation
completeness levels.
Results reveal a stark performance disparity: LLMs achieve 84% to 89%
correctness on established synthetic benchmarks but only 25% to 34% on
real-world class tasks, with negligible differences between familiar and novel
codebases. Comprehensive docstrings yield modest gains of 1% to 3% in
functional accuracy, though statistical significance is rare.
Retrieval-augmented generation proves most effective with partial
documentation, improving correctness by 4% to 7% by supplying concrete
implementation patterns absent from specifications. Error profiling identifies
AttributeError, TypeError, and AssertionError as dominant failure modes (84% of
cases), with synthetic tests overemphasizing assertion issues and real-world
scenarios highlighting type and attribute mismatches. Retrieval augmentation
reduces logical flaws but can introduce dependency conflicts.
The benchmark and analysis expose critical limitations in current LLM
capabilities for class-level engineering, offering actionable insights for
enhancing context modelling, documentation strategies, and retrieval
integration in production code assistance tools.
comment: Pre-print prepared for journal submission
☆ WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios
Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Kate Tolstaya, Sarah Tang, Brandyn White, Ben Sapp, Mingxing Tan, Jyh-Jing Hwang, Drago Anguelov
Vision-based end-to-end (E2E) driving has garnered significant interest in
the research community due to its scalability and synergy with multimodal large
language models (MLLMs). However, current E2E driving benchmarks primarily
feature nominal scenarios, failing to adequately test the true potential of
these systems. Furthermore, existing open-loop evaluation metrics often fall
short in capturing the multi-modal nature of driving or effectively evaluating
performance in long-tail scenarios. To address these gaps, we introduce the
Waymo Open Dataset for End-to-End Driving (WOD-E2E). WOD-E2E contains 4,021
driving segments (approximately 12 hours), specifically curated for challenging
long-tail scenarios that that are rare in daily life with an occurring
frequency of less than 0.03%. Concretely, each segment in WOD-E2E includes the
high-level routing information, ego states, and 360-degree camera views from 8
surrounding cameras. To evaluate the E2E driving performance on these long-tail
situations, we propose a novel open-loop evaluation metric: Rater Feedback
Score (RFS). Unlike conventional metrics that measure the distance between
predicted way points and the logs, RFS measures how closely the predicted
trajectory matches rater-annotated trajectory preference labels. We have
released rater preference labels for all WOD-E2E validation set segments, while
the held out test set labels have been used for the 2025 WOD-E2E Challenge.
Through our work, we aim to foster state of the art research into
generalizable, robust, and safe end-to-end autonomous driving agents capable of
handling complex real-world situations.
☆ EgoExo-Con: Exploring View-Invariant Video Temporal Understanding
Can Video-LLMs achieve consistent temporal understanding when videos capture
the same event from different viewpoints? To study this, we introduce
EgoExo-Con (Consistency), a benchmark of comprehensively synchronized
egocentric and exocentric video pairs with human-refined queries in natural
language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal
Verification and Temporal Grounding. It evaluates not only correctness but
consistency across viewpoints. Our analysis reveals two critical limitations of
existing Video-LLMs: (1) models often fail to maintain consistency, with
results far worse than their single-view performances. (2) When naively
finetuned with synchronized videos of both viewpoints, the models show improved
consistency but often underperform those trained on a single view. For
improvements, we propose View-GRPO, a novel reinforcement learning framework
that effectively strengthens view-specific temporal reasoning while encouraging
consistent comprehension across viewpoints. Our method demonstrates its
superiority over naive SFT and GRPO, especially for improving cross-view
consistency. All resources will be made publicly available.
comment: project page:
\url{https://minjoong507.github.io/projects/EgoExo-Con/}
☆ Security Risk of Misalignment between Text and Image in Multi-modal Model
Despite the notable advancements and versatility of multi-modal diffusion
models, such as text-to-image models, their susceptibility to adversarial
inputs remains underexplored. Contrary to expectations, our investigations
reveal that the alignment between textual and Image modalities in existing
diffusion models is inadequate. This misalignment presents significant risks,
especially in the generation of inappropriate or Not-Safe-For-Work (NSFW)
content. To this end, we propose a novel attack called Prompt-Restricted
Multi-modal Attack (PReMA) to manipulate the generated content by modifying the
input image in conjunction with any specified prompt, without altering the
prompt itself. PReMA is the first attack that manipulates model outputs by
solely creating adversarial images, distinguishing itself from prior methods
that primarily generate adversarial prompts to produce NSFW content.
Consequently, PReMA poses a novel threat to the integrity of multi-modal
diffusion models, particularly in image-editing applications that operate with
fixed prompts. Comprehensive evaluations conducted on image inpainting and
style transfer tasks across various models confirm the potent efficacy of
PReMA.
☆ SAFE: A Novel Approach to AI Weather Evaluation through Stratified Assessments of Forecasts over Earth
The dominant paradigm in machine learning is to assess model performance
based on average loss across all samples in some test set. This amounts to
averaging performance geospatially across the Earth in weather and climate
settings, failing to account for the non-uniform distribution of human
development and geography. We introduce Stratified Assessments of Forecasts
over Earth (SAFE), a package for elucidating the stratified performance of a
set of predictions made over Earth. SAFE integrates various data domains to
stratify by different attributes associated with geospatial gridpoints:
territory (usually country), global subregion, income, and landcover (land or
water). This allows us to examine the performance of models for each individual
stratum of the different attributes (e.g., the accuracy in every individual
country). To demonstrate its importance, we utilize SAFE to benchmark a zoo of
state-of-the-art AI-based weather prediction models, finding that they all
exhibit disparities in forecasting skill across every attribute. We use this to
seed a benchmark of model forecast fairness through stratification at different
lead times for various climatic variables. By moving beyond globally-averaged
metrics, we for the first time ask: where do models perform best or worst, and
which models are most fair? To support further work in this direction, the SAFE
package is open source and available at https://github.com/N-Masi/safe
☆ GUI Knowledge Bench: Revealing the Knowledge Gap Behind VLM Failures in GUI Tasks
Chenrui Shi, Zedong Yu, Zhi Gao, Ruining Feng, Enqi Liu, Yuwei Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, Qing Li
Large vision language models (VLMs) have advanced graphical user interface
(GUI) task automation but still lag behind humans. We hypothesize this gap
stems from missing core GUI knowledge, which existing training schemes (such as
supervised fine tuning and reinforcement learning) alone cannot fully address.
By analyzing common failure patterns in GUI task execution, we distill GUI
knowledge into three dimensions: (1) interface perception, knowledge about
recognizing widgets and system states; (2) interaction prediction, knowledge
about reasoning action state transitions; and (3) instruction understanding,
knowledge about planning, verifying, and assessing task completion progress. We
further introduce GUI Knowledge Bench, a benchmark with multiple choice and
yes/no questions across six platforms (Web, Android, MacOS, Windows, Linux,
IOS) and 292 applications. Our evaluation shows that current VLMs identify
widget functions but struggle with perceiving system states, predicting
actions, and verifying task completion. Experiments on real world GUI tasks
further validate the close link between GUI knowledge and task success. By
providing a structured framework for assessing GUI knowledge, our work supports
the selection of VLMs with greater potential prior to downstream training and
provides insights for building more capable GUI agents.
☆ Lean4Physics: Comprehensive Reasoning Framework for College-level Physics in Lean4
Yuxin Li, Minghao Liu, Ruida Wang, Wenzhao Ji, Zhitao He, Rui Pan, Junming Huang, Tong Zhang, Yi R. Fung
We present **Lean4PHYS**, a comprehensive reasoning framework for
college-level physics problems in Lean4. **Lean4PHYS** includes
*LeanPhysBench*, a college-level benchmark for formal physics reasoning in
Lean4, which contains 200 hand-crafted and peer-reviewed statements derived
from university textbooks and physics competition problems. To establish a
solid foundation for formal reasoning in physics, we also introduce *PhysLib*,
a community-driven repository containing fundamental unit systems and theorems
essential for formal physics reasoning. Based on the benchmark and Lean4
repository we composed in **Lean4PHYS**, we report baseline results using major
expert Math Lean4 provers and state-of-the-art closed-source models, with the
best performance of DeepSeek-Prover-V2-7B achieving only 16% and
Claude-Sonnet-4 achieving 35%. We also conduct a detailed analysis showing that
our *PhysLib* can achieve an average improvement of 11.75% in model
performance. This demonstrates the challenging nature of our *LeanPhysBench*
and the effectiveness of *PhysLib*. To the best of our knowledge, this is the
first study to provide a physics benchmark in Lean4.
☆ Network-Constrained Policy Optimization for Adaptive Multi-agent Vehicle Routing
Traffic congestion in urban road networks leads to longer trip times and
higher emissions, especially during peak periods. While the Shortest Path First
(SPF) algorithm is optimal for a single vehicle in a static network, it
performs poorly in dynamic, multi-vehicle settings, often worsening congestion
by routing all vehicles along identical paths. We address dynamic vehicle
routing through a multi-agent reinforcement learning (MARL) framework for
coordinated, network-aware fleet navigation. We first propose Adaptive
Navigation (AN), a decentralized MARL model where each intersection agent
provides routing guidance based on (i) local traffic and (ii) neighborhood
state modeled using Graph Attention Networks (GAT). To improve scalability in
large networks, we further propose Hierarchical Hub-based Adaptive Navigation
(HHAN), an extension of AN that assigns agents only to key intersections
(hubs). Vehicles are routed hub-to-hub under agent control, while SPF handles
micro-routing within each hub region. For hub coordination, HHAN adopts
centralized training with decentralized execution (CTDE) under the Attentive
Q-Mixing (A-QMIX) framework, which aggregates asynchronous vehicle decisions
via attention. Hub agents use flow-aware state features that combine local
congestion and predictive dynamics for proactive routing. Experiments on
synthetic grids and real urban maps (Toronto, Manhattan) show that AN reduces
average travel time versus SPF and learning baselines, maintaining 100% routing
success. HHAN scales to networks with hundreds of intersections, achieving up
to 15.9% improvement under heavy traffic. These findings highlight the
potential of network-constrained MARL for scalable, coordinated, and
congestion-aware routing in intelligent transportation systems.
comment: 29 pages, 12 figures. Fazel Arasteh and Arian Haghparast contributed
equally to this research. Submitted to ACM Transactions on Spatial Algorithms
and Systems (TSAS). The code for this work is publicly available at
https://github.com/Arianhgh/HHAN
☆ Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism
Yuhua Jiang, Shuang Cheng, Yihao Liu, Ermo Hua, Che Jiang, Weigao Sun, Yu Cheng, Feifei Gao, Biqing Qi, Bowen Zhou
Specialized Generalist Models (SGMs) aim to preserve broad capabilities while
achieving expert-level performance in target domains. However, traditional LLM
structures including Transformer, Linear Attention, and hybrid models do not
employ specialized memory mechanism guided by task information. In this paper,
we present Nirvana, an SGM with specialized memory mechanism, linear time
complexity, and test-time task information extraction. Besides, we propose the
Task-Aware Memory Trigger ($\textit{Trigger}$) that flexibly adjusts memory
mechanism based on the current task's requirements. In Trigger, each incoming
sample is treated as a self-supervised fine-tuning task, enabling Nirvana to
adapt its task-related parameters on the fly to domain shifts. We also design
the Specialized Memory Updater ($\textit{Updater}$) that dynamically memorizes
the context guided by Trigger. We conduct experiments on both general language
tasks and specialized medical tasks. On a variety of natural language modeling
benchmarks, Nirvana achieves competitive or superior results compared to the
existing LLM structures. To prove the effectiveness of Trigger on specialized
tasks, we test Nirvana's performance on a challenging medical task, i.e.,
Magnetic Resonance Imaging (MRI). We post-train frozen Nirvana backbone with
lightweight codecs on paired electromagnetic signals and MRI images. Despite
the frozen Nirvana backbone, Trigger guides the model to adapt to the MRI
domain with the change of task-related parameters. Nirvana achieves
higher-quality MRI reconstruction compared to conventional MRI models as well
as the models with traditional LLMs' backbone, and can also generate accurate
preliminary clinical reports accordingly.
☆ Learning Geometry: A Framework for Building Adaptive Manifold Models through Metric Optimization
This paper proposes a novel paradigm for machine learning that moves beyond
traditional parameter optimization. Unlike conventional approaches that search
for optimal parameters within a fixed geometric space, our core idea is to
treat the model itself as a malleable geometric entity. Specifically, we
optimize the metric tensor field on a manifold with a predefined topology,
thereby dynamically shaping the geometric structure of the model space. To
achieve this, we construct a variational framework whose loss function
carefully balances data fidelity against the intrinsic geometric complexity of
the manifold. The former ensures the model effectively explains observed data,
while the latter acts as a regularizer, penalizing overly curved or irregular
geometries to encourage simpler models and prevent overfitting. To address the
computational challenges of this infinite-dimensional optimization problem, we
introduce a practical method based on discrete differential geometry: the
continuous manifold is discretized into a triangular mesh, and the metric
tensor is parameterized by edge lengths, enabling efficient optimization using
automatic differentiation tools. Theoretical analysis reveals a profound
analogy between our framework and the Einstein-Hilbert action in general
relativity, providing an elegant physical interpretation for the concept of
"data-driven geometry". We further argue that even with fixed topology, metric
optimization offers significantly greater expressive power than models with
fixed geometry. This work lays a solid foundation for constructing fully
dynamic "meta-learners" capable of autonomously evolving their geometry and
topology, and it points to broad application prospects in areas such as
scientific model discovery and robust representation learning.
comment: 9 pages
☆ Data-driven Projection Generation for Efficiently Solving Heterogeneous Quadratic Programming Problems
We propose a data-driven framework for efficiently solving quadratic
programming (QP) problems by reducing the number of variables in
high-dimensional QPs using instance-specific projection. A graph neural
network-based model is designed to generate projections tailored to each QP
instance, enabling us to produce high-quality solutions even for previously
unseen problems. The model is trained on heterogeneous QPs to minimize the
expected objective value evaluated on the projected solutions. This is
formulated as a bilevel optimization problem; the inner optimization solves the
QP under a given projection using a QP solver, while the outer optimization
updates the model parameters. We develop an efficient algorithm to solve this
bilevel optimization problem, which computes parameter gradients without
backpropagating through the solver. We provide a theoretical analysis of the
generalization ability of solving QPs with projection matrices generated by
neural networks. Experimental results demonstrate that our method produces
high-quality feasible solutions with reduced computation time, outperforming
existing methods.
☆ Can AI be Accountable?
The AI we use is powerful, and its power is increasing rapidly. If this
powerful AI is to serve the needs of consumers, voters, and decision makers,
then it is imperative that the AI is accountable. In general, an agent is
accountable to a forum if the forum can request information from the agent
about its actions, if the forum and the agent can discuss this information, and
if the forum can sanction the agent. Unfortunately, in too many cases today's
AI is not accountable -- we cannot question it, enter into a discussion with
it, let alone sanction it. In this chapter we relate the general definition of
accountability to AI, we illustrate what it means for AI to be accountable and
unaccountable, and we explore approaches that can improve our chances of living
in a world where all AI is accountable to those who are affected by it.
comment: To be published as a chapter in Daniele Quercia and Marios
Constantinides (Eds.). Operationalizing Responsible AI. Cambridge University
Press. Forthcoming
☆ Dynamic VLM-Guided Negative Prompting for Diffusion Models NeurIPS 2025
We propose a novel approach for dynamic negative prompting in diffusion
models that leverages Vision-Language Models (VLMs) to adaptively generate
negative prompts during the denoising process. Unlike traditional Negative
Prompting methods that use fixed negative prompts, our method generates
intermediate image predictions at specific denoising steps and queries a VLM to
produce contextually appropriate negative prompts. We evaluate our approach on
various benchmark datasets and demonstrate the trade-offs between negative
guidance strength and text-image alignment.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS
2025) Workshop: The First Workshop on Generative and Protective AI for
Content Creation
☆ Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods
Knowledge distillation (KD) is an effective method for model compression and
transferring knowledge between models. However, its effect on model's
robustness against spurious correlations that degrade performance on
out-of-distribution data remains underexplored. This study investigates the
effect of knowledge distillation on the transferability of ``debiasing''
capabilities from teacher models to student models on natural language
inference (NLI) and image classification tasks. Through extensive experiments,
we illustrate several key findings: (i) overall the debiasing capability of a
model is undermined post-KD; (ii) training a debiased model does not benefit
from injecting teacher knowledge; (iii) although the overall robustness of a
model may remain stable post-distillation, significant variations can occur
across different types of biases; and (iv) we pin-point the internal attention
pattern and circuit that causes the distinct behavior post-KD. Given the above
findings, we propose three effective solutions to improve the distillability of
debiasing methods: developing high quality data for augmentation, implementing
iterative knowledge distillation, and initializing student models with weights
obtained from teacher models. To the best of our knowledge, this is the first
study on the effect of KD on debiasing and its interenal mechanism at scale.
Our findings provide understandings on how KD works and how to design better
debiasing methods.
☆ SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning
The ability of LLM agents to plan and invoke tools exposes them to new safety
risks, making a comprehensive red-teaming system crucial for discovering
vulnerabilities and ensuring their safe deployment. We present SIRAJ: a generic
red-teaming framework for arbitrary black-box LLM agents. We employ a dynamic
two-step process that starts with an agent definition and generates diverse
seed test cases that cover various risk outcomes, tool-use trajectories, and
risk sources. Then, it iteratively constructs and refines model-based
adversarial attacks based on the execution trajectories of former attempts. To
optimize the red-teaming cost, we present a model distillation approach that
leverages structured forms of a teacher model's reasoning to train smaller
models that are equally effective. Across diverse evaluation agent settings,
our seed test case generation approach yields 2 -- 2.5x boost to the coverage
of risk outcomes and tool-calling trajectories. Our distilled 8B red-teamer
model improves attack success rate by 100%, surpassing the 671B Deepseek-R1
model. Our ablations and analyses validate the effectiveness of the iterative
framework, structured reasoning, and the generalization of our red-teamer
models.
☆ Artificial Intelligence-Enabled Analysis of Radiology Reports: Epidemiology and Consequences of Incidental Thyroid Findings
Felipe Larios, Mariana Borras-Osorio, Yuqi Wu, Ana Gabriela Claros, David Toro-Tobon, Esteban Cabezas, Ricardo Loor-Torres, Maria Mateo Chavez, Kerly Guevara Maldonado, Luis Vilatuna Andrango, Maria Lizarazo Jimenez, Ivan Mateo Alzamora, Misk Al Zahidy, Marcelo Montero, Ana Cristina Proano, Cristian Soto Jacome, Jungwei W. Fan, Oscar J. Ponce-Ponte, Megan E. Branda, Naykky Singh Ospina, Juan P. Brito
Importance Incidental thyroid findings (ITFs) are increasingly detected on
imaging performed for non-thyroid indications. Their prevalence, features, and
clinical consequences remain undefined. Objective To develop, validate, and
deploy a natural language processing (NLP) pipeline to identify ITFs in
radiology reports and assess their prevalence, features, and clinical outcomes.
Design, Setting, and Participants Retrospective cohort of adults without prior
thyroid disease undergoing thyroid-capturing imaging at Mayo Clinic sites from
July 1, 2017, to September 30, 2023. A transformer-based NLP pipeline
identified ITFs and extracted nodule characteristics from image reports from
multiple modalities and body regions. Main Outcomes and Measures Prevalence of
ITFs, downstream thyroid ultrasound, biopsy, thyroidectomy, and thyroid cancer
diagnosis. Logistic regression identified demographic and imaging-related
factors. Results Among 115,683 patients (mean age, 56.8 [SD 17.2] years; 52.9%
women), 9,077 (7.8%) had an ITF, of which 92.9% were nodules. ITFs were more
likely in women, older adults, those with higher BMI, and when imaging was
ordered by oncology or internal medicine. Compared with chest CT, ITFs were
more likely via neck CT, PET, and nuclear medicine scans. Nodule
characteristics were poorly documented, with size reported in 44% and other
features in fewer than 15% (e.g. calcifications). Compared with patients
without ITFs, those with ITFs had higher odds of thyroid nodule diagnosis,
biopsy, thyroidectomy and thyroid cancer diagnosis. Most cancers were
papillary, and larger when detected after ITFs vs no ITF. Conclusions ITFs were
common and strongly associated with cascades leading to the detection of small,
low-risk cancers. These findings underscore the role of ITFs in thyroid cancer
overdiagnosis and the need for standardized reporting and more selective
follow-up.
♻ ☆ UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection
The detection of ligand binding sites for proteins is a fundamental step in
Structure-Based Drug Design. Despite notable advances in recent years, existing
methods, datasets, and evaluation metrics are confronted with several key
challenges: (1) current datasets and methods are centered on individual
protein-ligand complexes and neglect that diverse binding sites may exist
across multiple complexes of the same protein, introducing significant
statistical bias; (2) ligand binding site detection is typically modeled as a
discontinuous workflow, employing binary segmentation and subsequent clustering
algorithms; (3) traditional evaluation metrics do not adequately reflect the
actual performance of different binding site prediction methods. To address
these issues, we first introduce UniSite-DS, the first UniProt (Unique
Protein)-centric ligand binding site dataset, which contains 4.81 times more
multi-site data and 2.08 times more overall data compared to the previously
most widely used datasets. We then propose UniSite, the first end-to-end ligand
binding site detection framework supervised by set prediction loss with
bijective matching. In addition, we introduce Average Precision based on
Intersection over Union (IoU) as a more accurate evaluation metric for ligand
binding site prediction. Extensive experiments on UniSite-DS and several
representative benchmark datasets demonstrate that IoU-based Average Precision
provides a more accurate reflection of prediction quality, and that UniSite
outperforms current state-of-the-art methods in ligand binding site detection.
The dataset and codes will be made publicly available at
https://github.com/quanlin-wu/unisite.
♻ ☆ Completion $\neq$ Collaboration: Scaling Collaborative Effort with Agents
Shannon Zejiang Shen, Valerie Chen, Ken Gu, Alexis Ross, Zixian Ma, Jillian Ross, Alex Gu, Chenglei Si, Wayne Chi, Andi Peng, Jocelyn J Shen, Ameet Talwalkar, Tongshuang Wu, David Sontag
Current evaluations of agents remain centered around one-shot task
completion, failing to account for the inherently iterative and collaborative
nature of many real-world problems, where human goals are often underspecified
and evolve. We argue for a shift from building and assessing task completion
agents to developing collaborative agents, assessed not only by the quality of
their final outputs but by how well they engage with and enhance human effort
throughout the problem-solving process. To support this shift, we introduce
collaborative effort scaling, a framework that captures how an agent's utility
grows with increasing user involvement. Through case studies and simulated
evaluations, we show that state-of-the-art agents often underperform in
multi-turn, real-world scenarios, revealing a missing ingredient in agent
design: the ability to sustain engagement and scaffold user understanding.
Collaborative effort scaling offers a lens for diagnosing agent behavior and
guiding development toward more effective interactions.
comment: 22 pages, 5 figures, 3 tables
♻ ☆ Controlling Thinking Speed in Reasoning Models NeurIPS 2025
Zhengkai Lin, Zhihang Fu, Ze Chen, Chao Chen, Liang Xie, Wenxiao Wang, Deng Cai, Zheng Wang, Jieping Ye
Human cognition is theorized to operate in two modes: fast, intuitive System
1 thinking and slow, deliberate System 2 thinking. While current Large
Reasoning Models (LRMs) excel at System 2 thinking, their inability to perform
fast thinking leads to high computational overhead and latency. In this work,
we enable LRMs to approximate human intelligence through dynamic thinking speed
adjustment, optimizing accuracy-efficiency trade-offs. Our approach addresses
two key questions: (1) how to control thinking speed in LRMs, and (2) when to
adjust it for optimal performance. For the first question, we identify the
steering vector that governs slow-fast thinking transitions in LRMs'
representation space. Using this vector, we achieve the first representation
editing-based test-time scaling effect, outperforming existing prompt-based
scaling methods. For the second question, we apply real-time difficulty
estimation to signal reasoning segments of varying complexity. Combining these
techniques, we propose the first reasoning strategy that enables fast
processing of easy steps and deeper analysis for complex reasoning. Without any
training or additional cost, our plug-in module delivers an average +1.3%
accuracy with -8.6% token usage across leading LRMs and advanced reasoning
benchmarks. All of our algorithms are implemented based on vLLM and are
expected to support broader applications and inspire future research.
comment: NeurIPS 2025 Spotlight
♻ ☆ RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards
Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Ellie Evans, Daniel Egert, Hoo-Chang Shin, Felipe Soares, Yi Dong, Oleksii Kuchaiev
Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning
with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM
post-training, each offering distinct advantages. However, RLHF struggles with
interpretability and reward hacking because it relies on human judgments that
usually lack explicit criteria, whereas RLVR is limited in scope by its focus
on correctness-based verifiers. We propose Reinforcement Learning with Binary
Flexible Feedback (RLBFF), which combines the versatility of human-driven
preferences with the precision of rule-based verification, enabling reward
models to capture nuanced aspects of response quality beyond mere correctness.
RLBFF extracts principles that can be answered in a binary fashion (e.g.
accuracy of information: yes, or code readability: no) from natural language
feedback. Such principles can then be used to ground Reward Model training as
an entailment task (response satisfies or does not satisfy an arbitrary
principle). We show that Reward Models trained in this manner can outperform
Bradley-Terry models when matched for data and achieve top performance on
RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24,
2025). Additionally, users can specify principles of interest at inference time
to customize the focus of our reward models, in contrast to Bradley-Terry
models. Finally, we present a fully open source recipe (including data) to
align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the
performance of o3-mini and DeepSeek R1 on general alignment benchmarks of
MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost). Models:
https://huggingface.co/collections/nvidia/reward-models-10-2025
comment: Added link to access models:
https://huggingface.co/collections/nvidia/reward-models-10-2025
♻ ☆ Curriculum Abductive Learning NeurIPS 2025
Abductive Learning (ABL) integrates machine learning with logical reasoning
in a loop: a learning model predicts symbolic concept labels from raw inputs,
which are revised through abduction using domain knowledge and then fed back
for retraining. However, due to the nondeterminism of abduction, the training
process often suffers from instability, especially when the knowledge base is
large and complex, resulting in a prohibitively large abduction space. While
prior works focus on improving candidate selection within this space, they
typically treat the knowledge base as a static black box. In this work, we
propose Curriculum Abductive Learning (C-ABL), a method that explicitly
leverages the internal structure of the knowledge base to address the ABL
training challenges. C-ABL partitions the knowledge base into a sequence of
sub-bases, progressively introduced during training. This reduces the abduction
space throughout training and enables the model to incorporate logic in a
stepwise, smooth way. Experiments across multiple tasks show that C-ABL
outperforms previous ABL implementations, significantly improves training
stability, convergence speed, and final accuracy, especially under complex
knowledge setting.
comment: Accepted by NeurIPS 2025, 22 pages, 6 figures
♻ ☆ Guided Model Merging for Hybrid Data Learning: Leveraging Centralized Data to Refine Decentralized Models WACV 2026
Junyi Zhu, Ruicong Yao, Taha Ceritli, Savas Ozkan, Matthew B. Blaschko, Eunchung Noh, Jeongwon Min, Cho Jung Min, Mete Ozay
Current network training paradigms primarily focus on either centralized or
decentralized data regimes. However, in practice, data availability often
exhibits a hybrid nature, where both regimes coexist. This hybrid setting
presents new opportunities for model training, as the two regimes offer
complementary trade-offs: decentralized data is abundant but subject to
heterogeneity and communication constraints, while centralized data, though
limited in volume and potentially unrepresentative, enables better curation and
high-throughput access. Despite its potential, effectively combining these
paradigms remains challenging, and few frameworks are tailored to hybrid data
regimes. To address this, we propose a novel framework that constructs a model
atlas from decentralized models and leverages centralized data to refine a
global model within this structured space. The refined model is then used to
reinitialize the decentralized models. Our method synergizes federated learning
(to exploit decentralized data) and model merging (to utilize centralized
data), enabling effective training under hybrid data availability.
Theoretically, we show that our approach achieves faster convergence than
methods relying solely on decentralized data, due to variance reduction in the
merging process. Extensive experiments demonstrate that our framework
consistently outperforms purely centralized, purely decentralized, and existing
hybrid-adaptable methods. Notably, our method remains robust even when the
centralized and decentralized data domains differ or when decentralized data
contains noise, significantly broadening its applicability.
comment: Accepted at WACV 2026
♻ ☆ Refine-n-Judge: Curating High-Quality Preference Chains for LLM-Fine-Tuning
Derin Cayir, Renjie Tao, Rashi Rungta, Kai Sun, Sean Chen, Haidar Khan, Minseok Kim, Julia Reinspach, Yue Liu
Large Language Models (LLMs) have demonstrated remarkable progress through
preference-based fine-tuning, which critically depends on the quality of the
underlying training data. While human feedback is essential for improving data
quality, it is costly and does not scale well. In this paper, we introduce
Refine-n-Judge, an automated iterative approach that leverages a single LLM as
both a refiner and a judge to enhance dataset quality. Unlike existing
iterative refinement methods, Refine-n-Judge employs an LLM to both generate
refinements and explicitly evaluate each improvement, ensuring that every
iteration meaningfully enhances the dataset without requiring additional human
annotation or a separate reward model. At each step, the LLM refines a response
and judges whether the refinement is an improvement over the previous answer.
This process continues until the LLM prefers the initial answer over the
refinement, indicating no further improvements. This produces sequences of
increasing quality, preference-labeled responses ideal for fine-tuning.
We demonstrate the effectiveness of Refine-n-Judge across a range of public
datasets spanning five corpora, targeting tasks such as coding, math, and
conversation. Models (Llama 3.1-8B and Llama 3.3-70B) fine-tuned on
Refine-n-Judge-enhanced datasets were preferred by LLM judges in over 74% of
comparisons against models tuned on the original dataset by GPT-4.
Additionally, we report performance gains: +5% on AlpacaEval and AlpacaEval
2.0, and +19% on MT-Bench. Our results indicate that Refine-n-Judge produces
high-quality datasets and scalable model improvements.
♻ ☆ CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting ISWC 2025
Language interpretation is a compositional process, in which the meaning of
more complex linguistic structures is inferred from the meaning of their parts.
Large language models possess remarkable language interpretation capabilities
and have been successfully applied to interpret questions by mapping them to
SPARQL queries. An open question is how systematic this interpretation process
is. Toward this question, in this paper, we propose a benchmark for
investigating to what extent the abilities of LLMs to interpret questions are
actually compositional. For this, we generate three datasets of varying
difficulty based on graph patterns in DBpedia, relying on Lemon lexica for
verbalization. Our datasets are created in a very controlled fashion in order
to test the ability of LLMs to interpret structurally complex questions, given
that they have seen the atomic building blocks. This allows us to evaluate to
what degree LLMs are able to interpret complex questions for which they
"understand" the atomic parts. We conduct experiments with models of different
sizes using both various prompt and few-shot optimization techniques as well as
fine-tuning. Our results show that performance in terms of macro $F_1$ degrades
from $0.45$ over $0.26$ down to $0.09$ with increasing deviation from the
samples optimized on. Even when all necessary information was provided to the
model in the input, the $F_1$ scores do not exceed $0.57$ for the dataset of
lowest complexity. We thus conclude that LLMs struggle to systematically and
compositionally interpret questions and map them into SPARQL queries.
comment: Research Track, 24th International Semantic Web Conference (ISWC
2025), November 2-6, 2025, Nara, Japan
♻ ☆ Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media
Soorya Ram Shimgekar, Ruining Zhao, Agam Goyal, Violeta J. Rodriguez, Paul A. Bloom, Hari Sundaram, Koustuv Saha
On social media, many individuals experiencing suicidal ideation (SI) do not
disclose their distress explicitly. Instead, signs may surface indirectly
through everyday posts or peer interactions. Detecting such implicit signals
early is critical but remains challenging. We frame early and implicit SI as a
forward-looking prediction task and develop a computational framework that
models a user's information environment, consisting of both their longitudinal
posting histories as well as the discourse of their socially proximal peers. We
adopted a composite network centrality measure to identify top neighbors of a
user, and temporally aligned the user's and neighbors' interactions --
integrating the multi-layered signals in a fine-tuned DeBERTa-v3 model. In a
Reddit study of 1,000 (500 Case and 500 Control) users, our approach improves
early and implicit SI detection by 15% over individual-only baselines. These
findings highlight that peer interactions offer valuable predictive signals and
carry broader implications for designing early detection systems that capture
indirect as well as masked expressions of risk in online environments.
♻ ☆ Audio Signal Processing Using Time Domain Mel-Frequency Wavelet Coefficient
Extracting features from the speech is the most critical process in speech
signal processing. Mel Frequency Cepstral Coefficients (MFCC) are the most
widely used features in the majority of the speaker and speech recognition
applications, as the filtering in this feature is similar to the filtering
taking place in the human ear. But the main drawback of this feature is that it
provides only the frequency information of the signal but does not provide the
information about at what time which frequency is present. The wavelet
transform, with its flexible time-frequency window, provides time and frequency
information of the signal and is an appropriate tool for the analysis of
non-stationary signals like speech. On the other hand, because of its uniform
frequency scaling, a typical wavelet transform may be less effective in
analysing speech signals, have poorer frequency resolution in low frequencies,
and be less in line with human auditory perception. Hence, it is necessary to
develop a feature that incorporates the merits of both MFCC and wavelet
transform. A great deal of studies are trying to combine both these features.
The present Wavelet Transform based Mel-scaled feature extraction methods
require more computation when a wavelet transform is applied on top of
Mel-scale filtering, since it adds extra processing steps. Here we are
proposing a method to extract Mel scale features in time domain combining the
concept of wavelet transform, thus reducing the computational burden of
time-frequency conversion and the complexity of wavelet extraction. Combining
our proposed Time domain Mel frequency Wavelet Coefficient(TMFWC) technique
with the reservoir computing methodology has significantly improved the
efficiency of audio signal processing.
♻ ☆ MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
Dense Video Object Captioning (DVOC) is the task of jointly detecting,
tracking, and captioning object trajectories in a video, requiring the ability
to understand spatio-temporal details and describe them in natural language.
Due to the complexity of the task and the high cost associated with manual
annotation, previous approaches resort to disjoint training strategies,
potentially leading to suboptimal performance. To circumvent this issue, we
propose to generate captions about spatio-temporally localized entities
leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets
with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an
end-to-end model capable of jointly detecting, segmenting, tracking and
captioning object trajectories. Moreover, with pretraining on LVISCap and
LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three
existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are
available at https://www.gabriel.fiastre.fr/maskcaptioner/.
comment: 20 pages, 8 figures
♻ ☆ LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback
Jailbreaks are adversarial attacks designed to bypass the built-in safety
mechanisms of large language models. Automated jailbreaks typically optimize an
adversarial suffix or adapt long prompt templates by forcing the model to
generate the initial part of a restricted or harmful response. In this work, we
show that existing jailbreak attacks that leverage such mechanisms to unlock
the model response can be detected by a straightforward perplexity-based
filtering on the input prompt. To overcome this issue, we propose LatentBreak,
a white-box jailbreak attack that generates natural adversarial prompts with
low perplexity capable of evading such defenses. LatentBreak substitutes words
in the input prompt with semantically-equivalent ones, preserving the initial
intent of the prompt, instead of adding high-perplexity adversarial suffixes or
long templates. These words are chosen by minimizing the distance in the latent
space between the representation of the adversarial prompt and that of harmless
requests. Our extensive evaluation shows that LatentBreak leads to shorter and
low-perplexity prompts, thus outperforming competing jailbreak algorithms
against perplexity-based filters on multiple safety-aligned models.
♻ ☆ SignalLLM: A General-Purpose LLM Agent Framework for Automated Signal Processing
Modern signal processing (SP) pipelines, whether model-based or data-driven,
often constrained by complex and fragmented workflow, rely heavily on expert
knowledge and manual engineering, and struggle with adaptability and
generalization under limited data. In contrast, Large Language Models (LLMs)
offer strong reasoning capabilities, broad general-purpose knowledge,
in-context learning, and cross-modal transfer abilities, positioning them as
powerful tools for automating and generalizing SP workflows. Motivated by these
potentials, we introduce SignalLLM, the first general-purpose LLM-based agent
framework for general SP tasks. Unlike prior LLM-based SP approaches that are
limited to narrow applications or tricky prompting, SignalLLM introduces a
principled, modular architecture. It decomposes high-level SP goals into
structured subtasks via in-context learning and domain-specific retrieval,
followed by hierarchical planning through adaptive retrieval-augmented
generation (RAG) and refinement; these subtasks are then executed through
prompt-based reasoning, cross-modal reasoning, code synthesis, model
invocation, or data-driven LLM-assisted modeling. Its generalizable design
enables the flexible selection of problem solving strategies across different
signal modalities, task types, and data conditions. We demonstrate the
versatility and effectiveness of SignalLLM through five representative tasks in
communication and sensing, such as radar target detection, human activity
recognition, and text compression. Experimental results show superior
performance over traditional and existing LLM-based methods, particularly in
few-shot and zero-shot settings.
comment: 11 pages
♻ ☆ AI's Social Forcefield: Reshaping Distributed Cognition in Human-AI Teams
AI is not only a neutral tool in team settings; it actively reshapes the
social and cognitive fabric of collaboration. We advance a unified framework of
alignment in distributed cognition in human-AI teams -- a process through which
linguistic, cognitive, and social coordination emerge as human and AI agents
co-construct a shared representational space. Across two studies, we show that
exposure to AI-generated language shapes not only how people speak, but also
how they think, what they attend to, and how they relate to each other.
Together, these findings reveal how AI participation reorganizes the
distributed cognitive architecture of teams: AI systems function as implicit
social forcefields. Our findings highlight the double-edged impact of AI: the
same mechanisms that enable efficient collaboration can also erode epistemic
diversity and undermine natural alignment processes. We argue for rethinking AI
in teams as a socially influential actor and call for new design paradigms that
foreground transparency, controllability, and group-level dynamics to foster
responsible, productive human-AI collaboration.
♻ ☆ Epistemic Diversity and Knowledge Collapse in Large Language Models
Dustin Wright, Sarah Masud, Jared Moore, Srishti Yadav, Maria Antoniak, Chan Young Park, Isabelle Augenstein
Large language models (LLMs) tend to generate lexically, semantically, and
stylistically homogenous texts. This poses a risk of knowledge collapse, where
homogenous LLMs mediate a shrinking in the range of accessible information over
time. Existing works on homogenization are limited by a focus on closed-ended
multiple-choice setups or fuzzy semantic features, and do not look at trends
across time and cultural contexts. To overcome this, we present a new
methodology to measure epistemic diversity, i.e., variation in real-world
claims in LLM outputs, which we use to perform a broad empirical study of LLM
knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200
prompt variations sourced from real user chats. For the topics in our study, we
show that while newer models tend to generate more diverse claims, nearly all
models are less epistemically diverse than a basic web search. We find that
model size has a negative impact on epistemic diversity, while
retrieval-augmented generation (RAG) has a positive impact, though the
improvement from RAG varies by the cultural context. Finally, compared to a
traditional knowledge source (Wikipedia), we find that country-specific claims
reflect the English language more than the local one, highlighting a gap in
epistemic representation
comment: 16 pages; 8 figures, 4 tables; v2 changelog: Fixed the modeling for
table 3, random effect is the model version; v3 changelog: Fixed minor
formatting issues in tables 2 and 3; v4 changelog: Fixed some typos and model
description
♻ ☆ Incentivizing LLMs to Self-Verify Their Answers
Large Language Models (LLMs) have demonstrated remarkable progress in complex
reasoning tasks through both post-training and test-time scaling laws. While
prevalent test-time scaling approaches are often realized by using external
reward models to guide the model generation process, we find that only marginal
gains can be acquired when scaling a model post-trained on specific reasoning
tasks. We identify that the limited improvement stems from distribution
discrepancies between the specific post-trained generator and the general
reward model. To address this, we propose a framework that incentivizes LLMs to
self-verify their own answers. By unifying answer generation and verification
within a single reinforcement learning (RL) process, we train models that can
effectively assess the correctness of their own solutions. The trained model
can further scale its performance at inference time by verifying its
generations, without the need for external verifiers. We train our
self-verification models based on Qwen2.5-Math-7B and
DeepSeek-R1-Distill-Qwen-1.5B, demonstrating their capabilities across varying
reasoning context lengths. Experiments on multiple mathematical reasoning
benchmarks show that our models can not only improve post-training performance
but also enable effective test-time scaling.
♻ ☆ Human-Like Goalkeeping in a Realistic Football Simulation: a Sample-Efficient Reinforcement Learning Approach
Alessandro Sestini, Joakim Bergdahl, Jean-Philippe Barrette-LaPierre, Florian Fuchs, Brady Chen, Michael Jones, Linus Gisslén
While several high profile video games have served as testbeds for Deep
Reinforcement Learning (DRL), this technique has rarely been employed by the
game industry for crafting authentic AI behaviors. Previous research focuses on
training super-human agents with large models, which is impractical for game
studios with limited resources aiming for human-like agents. This paper
proposes a sample-efficient DRL method tailored for training and fine-tuning
agents in industrial settings such as the video game industry. Our method
improves sample efficiency of value-based DRL by leveraging pre-collected data
and increasing network plasticity. We evaluate our method training a goalkeeper
agent in EA SPORTS FC 25, one of the best-selling football simulations today.
Our agent outperforms the game's built-in AI by 10% in ball saving rate.
Ablation studies show that our method trains agents 50% faster compared to
standard DRL methods. Finally, qualitative evaluation from domain experts
indicates that our approach creates more human-like gameplay compared to
hand-crafted agents. As a testament to the impact of the approach, the method
has been adopted for use in the most recent release of the series.
♻ ☆ A mathematical certification for positivity conditions in Neural Networks with applications to partial monotonicity and Trustworthy AI
Artificial Neural Networks (ANNs) have become a powerful tool for modeling
complex relationships in large-scale datasets. However, their black-box nature
poses trustworthiness challenges. In certain situations, ensuring trust in
predictions might require following specific partial monotonicity constraints.
However, certifying if an already-trained ANN is partially monotonic is
challenging. Therefore, ANNs are often disregarded in some critical
applications, such as credit scoring, where partial monotonicity is required.
To address this challenge, this paper presents a novel algorithm (LipVor) that
certifies if a black-box model, such as an ANN, is positive based on a finite
number of evaluations. Consequently, since partial monotonicity can be
expressed as a positivity condition on partial derivatives, LipVor can certify
whether an ANN is partially monotonic. To do so, for every positively evaluated
point, the Lipschitzianity of the black-box model is used to construct a
specific neighborhood where the function remains positive. Next, based on the
Voronoi diagram of the evaluated points, a sufficient condition is stated to
certify if the function is positive in the domain. Unlike prior methods, our
approach certifies partial monotonicity without constrained architectures or
piece-wise linear activations. Therefore, LipVor could open up the possibility
of using unconstrained ANN in some critical fields. Moreover, some other
properties of an ANN, such as convexity, can be posed as positivity conditions,
and therefore, LipVor could also be applied.
comment: 16 pages, 4 figures
♻ ☆ Collab-REC: An LLM-based Agentic Framework for Balancing Recommendations in Tourism
We propose Collab-REC, a multi-agent framework designed to counteract
popularity bias and enhance diversity in tourism recommendations. In our
setting, three LLM-based agents -- Personalization, Popularity, and
Sustainability generate city suggestions from complementary perspectives. A
non-LLM moderator then merges and refines these proposals via multi-round
negotiation, ensuring each agent's viewpoint is incorporated while penalizing
spurious or repeated responses. Experiments on European city queries show that
Collab-REC improves diversity and overall relevance compared to a single-agent
baseline, surfacing lesser-visited locales that often remain overlooked. This
balanced, context-aware approach addresses over-tourism and better aligns with
constraints provided by the user, highlighting the promise of multi-stakeholder
collaboration in LLM-driven recommender systems.
♻ ☆ UV-Attack: Physical-World Adversarial Attacks for Person Detection via Dynamic-NeRF-based UV Mapping ICLR2025
In recent research, adversarial attacks on person detectors using patches or
static 3D model-based texture modifications have struggled with low success
rates due to the flexible nature of human movement. Modeling the 3D
deformations caused by various actions has been a major challenge. Fortunately,
advancements in Neural Radiance Fields (NeRF) for dynamic human modeling offer
new possibilities. In this paper, we introduce UV-Attack, a groundbreaking
approach that achieves high success rates even with extensive and unseen human
actions. We address the challenge above by leveraging dynamic-NeRF-based UV
mapping. UV-Attack can generate human images across diverse actions and
viewpoints, and even create novel actions by sampling from the SMPL parameter
space. While dynamic NeRF models are capable of modeling human bodies,
modifying clothing textures is challenging because they are embedded in neural
network parameters. To tackle this, UV-Attack generates UV maps instead of RGB
images and modifies the texture stacks. This approach enables real-time texture
edits and makes the attack more practical. We also propose a novel Expectation
over Pose Transformation loss (EoPT) to improve the evasion success rate on
unseen poses and views. Our experiments show that UV-Attack achieves a 92.7%
attack success rate against the FastRCNN model across varied poses in dynamic
video settings, significantly outperforming the state-of-the-art AdvCamou
attack, which only had a 28.5% ASR. Moreover, we achieve 49.5% ASR on the
latest YOLOv8 detector in black-box settings. This work highlights the
potential of dynamic NeRF-based UV mapping for creating more effective
adversarial attacks on person detectors, addressing key challenges in modeling
human movement and texture modification. The code is available at
https://github.com/PolyLiYJ/UV-Attack.
comment: 23 pages, 22 figures, accepted by ICLR2025
♻ ☆ Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling NeurIPS 2025
Test-time scaling (TTS) has proven effective in enhancing the reasoning
capabilities of large language models (LLMs). Verification plays a key role in
TTS, simultaneously influencing (1) reasoning performance and (2) compute
efficiency, due to the quality and computational cost of verification. In this
work, we challenge the conventional paradigms of verification, and make the
first attempt toward systematically investigating the impact of verification
granularity-that is, how frequently the verifier is invoked during generation,
beyond verifying only the final output or individual generation steps. To this
end, we introduce Variable Granularity Search (VG-Search), a unified algorithm
that generalizes beam search and Best-of-N sampling via a tunable granularity
parameter g. Extensive experiments with VG-Search under varying compute
budgets, generator-verifier configurations, and task attributes reveal that
dynamically selecting g can improve the compute efficiency and scaling
behavior. Building on these findings, we propose adaptive VG-Search strategies
that achieve accuracy gains of up to 3.1\% over Beam Search and 3.6\% over
Best-of-N, while reducing FLOPs by over 52\%. We will open-source the code to
support future research.
comment: Accepted at NeurIPS 2025
♻ ☆ More of the Same: Persistent Representational Harms Under Increased Representation NeurIPS 2025
To recognize and mitigate the harms of generative AI systems, it is crucial
to consider who is represented in the outputs of generative AI systems and how
people are represented. A critical gap emerges when naively improving who is
represented, as this does not imply bias mitigation efforts have been applied
to address how people are represented. We critically examined this by
investigating gender representation in occupation across state-of-the-art large
language models. We first show evidence suggesting that over time there have
been interventions to models altering the resulting gender distribution, and we
find that women are more represented than men when models are prompted to
generate biographies or personas. We then demonstrate that representational
biases persist in how different genders are represented by examining
statistically significant word differences across genders. This results in a
proliferation of representational harms, stereotypes, and neoliberalism ideals
that, despite existing interventions to increase female representation,
reinforce existing systems of oppression.
comment: Accepted by the 39th Conference on Neural Information Processing
Systems (NeurIPS 2025) as a poster paper; 39 pages, 7 figures, 15 tables
♻ ☆ VerifIoU -- Robustness of Object Detection to Perturbations SC
Noémie Cohen, Mélanie Ducoffe, Ryma Boumazouza, Christophe Gabreau, Claire Pagetti, Xavier Pucel, Audrey Galametz
We introduce a novel Interval Bound Propagation (IBP) approach for the formal
verification of object detection models, specifically targeting the
Intersection over Union (IoU) metric. The approach has been implemented in an
open source code, named IBP IoU, compatible with popular abstract
interpretation based verification tools. The resulting verifier is evaluated on
landing approach runway detection and handwritten digit recognition case
studies. Comparisons against a baseline (Vanilla IBP IoU) highlight the
superior performance of IBP IoU in ensuring accuracy and stability,
contributing to more secure and robust machine learning applications.
comment: 44th Digital Avionics Systems Conference (DASC), Sep 2025, Montreal,
Canada
♻ ☆ MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks NeurIPS 2025
Yinghao Zhu, Ziyi He, Haoran Hu, Xiaochen Zheng, Xichen Zhang, Zixiang Wang, Junyi Gao, Liantao Ma, Lequan Yu
The rapid advancement of Large Language Models (LLMs) has stimulated interest
in multi-agent collaboration for addressing complex medical tasks. However, the
practical advantages of multi-agent collaboration approaches remain
insufficiently understood. Existing evaluations often lack generalizability,
failing to cover diverse tasks reflective of real-world clinical practice, and
frequently omit rigorous comparisons against both single-LLM-based and
established conventional methods. To address this critical gap, we introduce
MedAgentBoard, a comprehensive benchmark for the systematic evaluation of
multi-agent collaboration, single-LLM, and conventional approaches.
MedAgentBoard encompasses four diverse medical task categories: (1) medical
(visual) question answering, (2) lay summary generation, (3) structured
Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow
automation, across text, medical images, and structured EHR data. Our extensive
experiments reveal a nuanced landscape: while multi-agent collaboration
demonstrates benefits in specific scenarios, such as enhancing task
completeness in clinical workflow automation, it does not consistently
outperform advanced single LLMs (e.g., in textual medical QA) or, critically,
specialized conventional methods that generally maintain better performance in
tasks like medical VQA and EHR-based prediction. MedAgentBoard offers a vital
resource and actionable insights, emphasizing the necessity of a task-specific,
evidence-based approach to selecting and developing AI solutions in medicine.
It underscores that the inherent complexity and overhead of multi-agent
collaboration must be carefully weighed against tangible performance gains. All
code, datasets, detailed prompts, and experimental results are open-sourced at
https://medagentboard.netlify.app/.
comment: Accepted by NeurIPS 2025 Datasets & Benchmarks Track
♻ ☆ Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning ICASSP 2025
Recent studies suggest that the deeper layers of Large Language Models (LLMs)
contribute little to representation learning and can often be removed without
significant performance loss. However, such claims are typically drawn from
narrow evaluations and may overlook important aspects of model behavior. In
this work, we present a systematic study of depth utilization across diverse
dimensions, including evaluation protocols, task categories, and model
architectures. Our analysis confirms that very deep layers are generally less
effective than earlier ones, but their contributions vary substantially with
the evaluation setting. Under likelihood-based metrics without generation,
pruning most layers preserves performance, with only the initial few being
critical. By contrast, generation-based evaluation uncovers indispensable roles
for middle and deeper layers in enabling reasoning and maintaining long-range
coherence. We further find that knowledge and retrieval are concentrated in
shallow components, whereas reasoning accuracy relies heavily on deeper layers
-- yet can be reshaped through distillation. These results highlight that depth
usage in LLMs is highly heterogeneous and context-dependent, underscoring the
need for task-, metric-, and model-aware perspectives in both interpreting and
compressing large models.
comment: ICASSP 2025
♻ ☆ StyleGuard: Preventing Text-to-Image-Model-based Style Mimicry Attacks by Style Perturbations NIPS2025
Recently, text-to-image diffusion models have been widely used for style
mimicry and personalized customization through methods such as DreamBooth and
Textual Inversion. This has raised concerns about intellectual property
protection and the generation of deceptive content. Recent studies, such as
Glaze and Anti-DreamBooth, have proposed using adversarial noise to protect
images from these attacks. However, recent purification-based methods, such as
DiffPure and Noise Upscaling, have successfully attacked these latest defenses,
showing the vulnerabilities of these methods. Moreover, present methods show
limited transferability across models, making them less effective against
unknown text-to-image models. To address these issues, we propose a novel
anti-mimicry method, StyleGuard. We propose a novel style loss that optimizes
the style-related features in the latent space to make it deviate from the
original image, which improves model-agnostic transferability. Additionally, to
enhance the perturbation's ability to bypass diffusion-based purification, we
designed a novel upscale loss that involves ensemble purifiers and upscalers
during training. Extensive experiments on the WikiArt and CelebA datasets
demonstrate that StyleGuard outperforms existing methods in robustness against
various transformations and purifications, effectively countering style mimicry
in various models. Moreover, StyleGuard is effective on different style mimicry
methods, including DreamBooth and Textual Inversion. The code is available at
https://github.com/PolyLiYJ/StyleGuard.
comment: Accepted by NIPS2025
♻ ☆ Toward a Public and Secure Generative AI: A Comparative Analysis of Open and Closed LLMs
Generative artificial intelligence (Gen AI) systems represent a critical
technology with far-reaching implications across multiple domains of society.
However, their deployment entails a range of risks and challenges that require
careful evaluation. To date, there has been a lack of comprehensive,
interdisciplinary studies offering a systematic comparison between open-source
and proprietary (closed) generative AI systems, particularly regarding their
respective advantages and drawbacks. This study aims to: i) critically evaluate
and compare the characteristics, opportunities, and challenges of open and
closed generative AI models; and ii) propose foundational elements for the
development of an Open, Public, and Safe Gen AI framework. As a methodology, we
adopted a combined approach that integrates three methods: literature review,
critical analysis, and comparative analysis. The proposed framework outlines
key dimensions, openness, public governance, and security, as essential pillars
for shaping the future of trustworthy and inclusive Gen AI. Our findings reveal
that open models offer greater transparency, auditability, and flexibility,
enabling independent scrutiny and bias mitigation. In contrast, closed systems
often provide better technical support and ease of implementation, but at the
cost of unequal access, accountability, and ethical oversight. The research
also highlights the importance of multi-stakeholder governance, environmental
sustainability, and regulatory frameworks in ensuring responsible development.
♻ ☆ Monopoly Deal: A Benchmark Environment for Bounded One-Sided Response Games
Card games are widely used to study sequential decision-making under
uncertainty, with real-world analogues in negotiation, finance, and
cybersecurity. These games typically fall into three categories based on the
flow of control: strictly sequential (players alternate single actions),
deterministic response (some actions trigger a fixed outcome), and unbounded
reciprocal response (alternating counterplays are permitted). A less-explored
but strategically rich structure is the bounded one-sided response, where a
player's action briefly transfers control to the opponent, who must satisfy a
fixed condition through one or more moves before the turn resolves. We term
games featuring this mechanism Bounded One-Sided Response Games (BORGs). We
introduce a modified version of Monopoly Deal as a benchmark environment that
isolates this dynamic, where a Rent action forces the opponent to choose
payment assets. The gold-standard algorithm, Counterfactual Regret Minimization
(CFR), converges on effective strategies without novel algorithmic extensions.
A lightweight full-stack research platform unifies the environment, a
parallelized CFR runtime, and a human-playable web interface. The trained CFR
agent and source code are available at https://monopolydeal.ai.
comment: 24 pages, 7 figures
♻ ☆ Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs
Kai Zhuang, Jiawei Zhang, Yumou Liu, Hanqun Cao, Chunbin Gu, Mengdi Liu, Zhangyang Gao, Zitong Jerry Wang, Xuanhe Zhou, Pheng-Ann Heng, Lijun Wu, Conghui He, Cheng Tan
Scientific Large Language Models (Sci-LLMs) have emerged as a promising
frontier for accelerating biological discovery. However, these models face a
fundamental challenge when processing raw biomolecular sequences: the
tokenization dilemma. Whether treating sequences as a specialized language,
risking the loss of functional motif information, or as a separate modality,
introducing formidable alignment challenges, current strategies fundamentally
limit their reasoning capacity. We challenge this sequence-centric paradigm by
positing that a more effective strategy is to provide Sci-LLMs with high-level
structured context derived from established bioinformatics tools, thereby
bypassing the need to interpret low-level noisy sequence data directly. Through
a systematic comparison of leading Sci-LLMs on biological reasoning tasks, we
tested three input modes: sequence-only, context-only, and a combination of
both. Our findings are striking: the context-only approach consistently and
substantially outperforms all other modes. Even more revealing, the inclusion
of the raw sequence alongside its high-level context consistently degrades
performance, indicating that raw sequences act as informational noise, even for
models with specialized tokenization schemes. These results suggest that the
primary strength of existing Sci-LLMs lies not in their nascent ability to
interpret biomolecular syntax from scratch, but in their profound capacity for
reasoning over structured, human-readable knowledge. Therefore, we argue for
reframing Sci-LLMs not as sequence decoders, but as powerful reasoning engines
over expert knowledge. This work lays the foundation for a new class of hybrid
scientific AI agents, repositioning the developmental focus from direct
sequence interpretation towards high-level knowledge synthesis. The code is
available at https://github.com/opendatalab-raiser/CoKE.
comment: 38 pages, under review
♻ ☆ In Defence of Post-hoc Explainability NeurIPS 2024
This position paper defends post-hoc explainability methods as legitimate
tools for scientific knowledge production in machine learning. Addressing
criticism of these methods' reliability and epistemic status, we develop a
philosophical framework grounded in mediated understanding and bounded
factivity. We argue that scientific insights can emerge through structured
interpretation of model behaviour without requiring complete mechanistic
transparency, provided explanations acknowledge their approximative nature and
undergo rigorous empirical validation. Through analysis of recent biomedical ML
applications, we demonstrate how post-hoc methods, when properly integrated
into scientific practice, generate novel hypotheses and advance phenomenal
understanding.
comment: v1 presented at the Interpretable AI: Past, Present, and Future
Workshop at NeurIPS 2024 (non-archival)
♻ ☆ On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration
Yehonathan Refael, Amit Aides, Aviad Barzilai, George Leifman, Genady Beryozkin, Vered Silverman, Bolous Jaber, Tomer Shekel
Open-vocabulary object detection (OVD) models offer remarkable flexibility by
detecting objects from arbitrary text queries. However, their zero-shot
performance in specialized domains like Remote Sensing (RS) is often
compromised by the inherent ambiguity of natural language, limiting critical
downstream applications. For instance, an OVD model may struggle to distinguish
between fine-grained classes such as "fishing boat" and "yacht" since their
embeddings are similar and often inseparable. This can hamper specific user
goals, such as monitoring illegal fishing, by producing irrelevant detections.
To address this, we propose a cascaded approach that couples the broad
generalization of a large pre-trained OVD model with a lightweight few-shot
classifier. Our method first employs the zero-shot model to generate
high-recall object proposals. These proposals are then refined for high
precision by a compact classifier trained in real-time on only a handful of
user-annotated examples - drastically reducing the high costs of RS imagery
annotation.The core of our framework is FLAME, a one-step active learning
strategy that selects the most informative samples for training. FLAME
identifies, on the fly, uncertain marginal candidates near the decision
boundary using density estimation, followed by clustering to ensure sample
diversity. This efficient sampling technique achieves high accuracy without
costly full-model fine-tuning and enables instant adaptation, within less then
a minute, which is significantly faster than state-of-the-art alternatives.Our
method consistently surpasses state-of-the-art performance on RS benchmarks,
establishing a practical and resource-efficient framework for adapting
foundation models to specific user needs.
♻ ☆ Tunable-Generalization Diffusion Powered by Self-Supervised Contextual Sub-Data for Low-Dose CT Reconstruction
Current models based on deep learning for low-dose CT denoising rely heavily
on paired data and generalize poorly. Even the more concerned diffusion models
need to learn the distribution of clean data for reconstruction, which is
difficult to satisfy in medical clinical applications. At the same time,
self-supervised-based methods face the challenge of significant degradation of
generalizability of models pre-trained for the current dose to expand to other
doses. To address these issues, this work proposes a novel method of
TUnable-geneRalizatioN Diffusion (TurnDiff) powered by self-supervised
contextual sub-data for low-dose CT reconstruction. Firstly, a contextual
subdata self-enhancing similarity strategy is designed for denoising centered
on the LDCT projection domain, which provides an initial prior for the
subsequent progress. Subsequently, the initial prior is used to combine
knowledge distillation with a deep combination of latent diffusion models for
optimizing image details. The pre-trained model is used for inference
reconstruction, and the pixel-level self-correcting fusion technique is
proposed for fine-grained reconstruction of the image domain to enhance the
image fidelity, using the initial prior and the LDCT image as a guide. In
addition, the technique is flexibly applied to the generalization of upper and
lower doses or even unseen doses. Dual-domain strategy cascade for
self-supervised LDCT denoising, TurnDiff requires only LDCT projection domain
data for training and testing. Comprehensive evaluation on both benchmark
datasets and real-world data demonstrates that TurnDiff consistently
outperforms state-of-the-art methods in both reconstruction and generalization.
♻ ☆ Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts
In aphasia research, Speech-Language Pathologists (SLPs) devote extensive
time to manually coding speech samples using Correct Information Units (CIUs),
a measure of how informative an individual sample of speech is. Developing
automated systems to recognize aphasic language is limited by data scarcity.
For example, only about 600 transcripts are available in AphasiaBank yet
billions of tokens are used to train large language models (LLMs). In the
broader field of machine learning (ML), researchers increasingly turn to
synthetic data when such are sparse. Therefore, this study constructs and
validates two methods to generate synthetic transcripts of the AphasiaBank Cat
Rescue picture description task. One method leverages a procedural programming
approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct
LLMs. The methods generate transcripts across four severity levels (Mild,
Moderate, Severe, Very Severe) through word dropping, filler insertion, and
paraphasia substitution. Overall, we found, compared to human-elicited
transcripts, Mistral 7b Instruct best captures key aspects of linguistic
degradation observed in aphasia, showing realistic directional changes in NDW,
word count, and word length amongst the synthetic generation methods. Based on
the results, future work should plan to create a larger dataset, fine-tune
models for better aphasic representation, and have SLPs assess the realism and
usefulness of the synthetic transcripts.
comment: 19 pages, 1 figure, 7 tables
♻ ☆ Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
Academic poster generation is a crucial yet challenging task in scientific
communication, requiring the compression of long-context interleaved documents
into a single, visually coherent page. To address this challenge, we introduce
the first benchmark and metric suite for poster generation, which pairs recent
conference papers with author-designed posters and evaluates outputs on
(i)Visual Quality-semantic alignment with human posters, (ii)Textual
Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic
and informational criteria scored by a VLM-as-judge, and notably
(iv)PaperQuiz-the poster's ability to convey core paper content as measured by
VLMs answering generated quizzes. Building on this benchmark, we propose
PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser
distills the paper into a structured asset library; the (b)Planner aligns
text-visual pairs into a binary-tree layout that preserves reading order and
spatial balance; and the (c)Painter-Commenter loop refines each panel by
executing rendering code and using VLM feedback to eliminate overflow and
ensure alignment. In our comprehensive evaluation, we find that GPT-4o
outputs-though visually appealing at first glance-often exhibit noisy text and
poor PaperQuiz scores, and we find that reader engagement is the primary
aesthetic bottleneck, as human-designed posters rely largely on visual
semantics to convey meaning. Our fully open-source variants (e.g. based on the
Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across
nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper
into a finalized yet editable .pptx poster - all for just $0.005. These
findings chart clear directions for the next generation of fully automated
poster-generation models. The code and datasets are available at
https://github.com/Paper2Poster/Paper2Poster.
comment: Project Page: https://github.com/Paper2Poster/Paper2Poster
♻ ☆ BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains
Vijay Devane, Mohd Nauman, Bhargav Patel, Aniket Mahendra Wakchoure, Yogeshkumar Sant, Shyam Pawar, Viraj Thakur, Ananya Godse, Sunil Patra, Neha Maurya, Suraj Racha, Nitish Kamal Singh, Ajay Nagpal, Piyush Sawarkar, Kundeshwar Vijayrao Pundalik, Rohit Saluja, Ganesh Ramakrishnan
The rapid advancement of large language models(LLMs) has intensified the need
for domain and culture specific evaluation. Existing benchmarks are largely
Anglocentric and domain-agnostic, limiting their applicability to India-centric
contexts. To address this gap, we introduce BhashaBench V1, the first
domain-specific, multi-task, bilingual benchmark focusing on critical Indic
knowledge systems. BhashaBench V1 contains 74,166 meticulously curated
question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from
authentic government and domain-specific exams. It spans four major domains:
Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and
covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs
reveals significant domain and language specific performance gaps, with
especially large disparities in low-resource domains. For instance, GPT-4o
achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models
consistently perform better on English content compared to Hindi across all
domains. Subdomain-level analysis shows that areas such as Cyber Law,
International Finance perform relatively well, while Panchakarma, Seed Science,
and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive
dataset for evaluating large language models across India's diverse knowledge
domains. It enables assessment of models' ability to integrate domain-specific
knowledge with bilingual understanding. All code, benchmarks, and resources are
publicly available to support open research.
♻ ☆ MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning? NeurIPS'25
Large foundation models face challenges in acquiring transferable, structured
thinking abilities, especially when supervised with rigid templates or
crowd-annotated instruction datasets. Unlike prior approaches, we focus on a
thinking-centric data synthesis paradigm that enables models to evolve through
self-generated, cognitively guided data. We propose MindGYM, a structured and
scalable framework for question synthesis, composed of: (1) Cognitive Thinking
Process Injection, which infuses high-level reasoning objectives to shape the
model's synthesis behavior; (2) Seed Single-Hop Question Synthesis, generating
atomic questions from diverse semantic types to encourage broader thinking; and
(3) Challenging Multi-Hop QA Synthesis, composing more complex multi-hop
questions based on QA seeds for deeper reasoning. Detailed analysis shows that
synthetic data generated by our method achieves 16.7% higher average quality
and 67.91% lower quality variance compared to baseline sources, highlighting
that both high-quality and self-contained data are essential for effective,
thinking-oriented fine-tuning. MindGYM improves performance on six reasoning
benchmarks, achieving gains of up to 16% on MathVision using only 400 data
samples, and generalizable improvements across different model sizes and
architectures. MindGYM underscores the viability of self-challenging mechanisms
in refining large model capabilities while minimizing human intervention and
resource demands. Code and data are released to promote data-centric research
into self-evolving foundation models driven by their internal reasoning
capabilities.
comment: Accepted by NeurIPS'25. 30 pages, 2 figures, 13 tables
♻ ☆ AIMeter: Measuring, Analyzing, and Visualizing Energy and Carbon Footprint of AI Workloads
The rapid advancement of AI, particularly large language models (LLMs), has
raised significant concerns about the energy use and carbon emissions
associated with model training and inference. However, existing tools for
measuring and reporting such impacts are often fragmented, lacking systematic
metric integration and offering limited support for correlation analysis among
them. This paper presents AIMeter, a comprehensive software toolkit for the
measurement, analysis, and visualization of energy use, power draw, hardware
performance, and carbon emissions across AI workloads. By seamlessly
integrating with existing AI frameworks, AIMeter offers standardized reports
and exports fine-grained time-series data to support benchmarking and
reproducibility in a lightweight manner. It further enables in-depth
correlation analysis between hardware metrics and model performance and thus
facilitates bottleneck identification and performance enhancement. By
addressing critical limitations in existing tools, AIMeter encourages the
research community to weigh environmental impact alongside raw performance of
AI workloads and advances the shift toward more sustainable "Green AI"
practices. The code is available at https://github.com/SusCom-Lab/AIMeter.
comment: 11 pages, 7 figures and 5 tables
♻ ☆ UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in Omni Models
Chen Chen, ZeYang Hu, Fengjiao Chen, Liya Ma, Jiaxing Liu, Xiaoyu Li, Ziwen Wang, Xuezhi Cao, Xunliang Cai
Multimodal Large Languages models have been progressing from uni-modal
understanding toward unifying visual, audio and language modalities,
collectively termed omni models. However, the correlation between uni-modal and
omni-modal remains unclear, which requires comprehensive evaluation to drive
omni model's intelligence evolution. In this work, we introduce a novel,
high-quality, and UNified Omni model benchmark, UNO-Bench. This benchmark is
designed to effectively evaluate both UNi-modal and Omni-modal capabilities
under a unified ability taxonomy, spanning 44 task types and 5 modality
combinations. It includes 1250 human curated samples for omni-modal with 98%
cross-modality solvability, and 2480 enhanced uni-modal samples. The
human-generated dataset is well-suited to real-world scenarios, particularly
within the Chinese context, whereas the automatically compressed dataset offers
a 90% increase in speed and maintains 98% consistency across 18 public
benchmarks. In addition to traditional multi-choice questions, we propose an
innovative multi-step open-ended question format to assess complex reasoning. A
general scoring model is incorporated, supporting 6 question types for
automated evaluation with 95% accuracy. Experimental result shows the
Compositional Law between omni-modal and uni-modal performance and the
omni-modal capability manifests as a bottleneck effect on weak models, while
exhibiting synergistic promotion on strong models.
comment: v3: Switch the paper template. Work in progress. Github:
https://github.com/meituan-longcat/UNO-Bench Hugging Face:
https://huggingface.co/datasets/meituan-longcat/UNO-Bench
♻ ☆ Reflection on Data Storytelling Tools in the Generative AI Era from the Human-AI Collaboration Perspective IEEE VIS 25
Human-AI collaborative tools attract attentions from the data storytelling
community to lower the expertise barrier and streamline the workflow. The
recent advance in large-scale generative AI techniques, e.g., large language
models (LLMs) and text-to-image models, has the potential to enhance data
storytelling with their power in visual and narration generation. After two
years since these techniques were publicly available, it is important to
reflect our progress of applying them and have an outlook for future
opportunities. To achieve the goal, we compare the collaboration patterns of
the latest tools with those of earlier ones using a dedicated framework for
understanding human-AI collaboration in data storytelling. Through comparison,
we identify consistently widely studied patterns, e.g., human-creator +
AI-assistant, and newly explored or emerging ones, e.g., AI-creator +
human-reviewer. The benefits of these AI techniques and implications to
human-AI collaboration are also revealed. We further propose future directions
to hopefully ignite innovations.
comment: This paper is a sequel to the CHI 24 paper "Where Are We So Far?
Understanding Data Storytelling Tools from the Perspective of Human-AI
Collaboration (https://doi.org/10.1145/3613904.3642726), aiming to refresh
our understanding with the latest advancements. It is accepted at IEEE VIS 25
♻ ☆ Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data NeurIPS'25
Fine-tuning large language models (LLMs) using diverse datasets is crucial
for enhancing their overall performance across various domains. In practical
scenarios, existing methods based on modeling the mixture proportions of data
composition often struggle with data whose domain labels are missing, imprecise
or non-normalized, while methods based on data selection usually encounter
difficulties in balancing multi-domain performance. To address these
challenges, in this work, we investigate the role of data diversity in
enhancing the overall abilities of LLMs by empirically constructing contrastive
data pools and theoretically deriving explanations. Building upon the insights
gained, we propose a new method that gives the LLM a dual identity: an output
model to cognitively probe and select data based on diversity reward, as well
as an input model to be tuned with the selected data. Extensive experiments
show that the proposed method notably boosts performance across
domain-undetermined data and a series of foundational downstream tasks when
applied to various advanced LLMs. We release our code and hope this study can
shed light on the understanding of data diversity and advance feedback-driven
data-model co-design for LLMs.
comment: Accepted by NeurIPS'25 main track. 47 pages, 21 figures, 32 tables
♻ ☆ Model-Document Protocol for AI Search
AI search depends on linking large language models (LLMs) with vast external
knowledge sources. Yet web pages, PDF files, and other raw documents are not
inherently LLM-ready: they are long, noisy, and unstructured. Conventional
retrieval methods treat these documents as verbatim text and return raw
passages, leaving the burden of fragment assembly and contextual reasoning to
the LLM. This gap underscores the need for a new retrieval paradigm that
redefines how models interact with documents.
We introduce the Model-Document Protocol (MDP), a general framework that
formalizes how raw text is bridged to LLMs through consumable knowledge
representations. Rather than treating retrieval as passage fetching, MDP
defines multiple pathways that transform unstructured documents into
task-specific, LLM-ready inputs. These include agentic reasoning, which curates
raw evidence into coherent context; memory grounding, which accumulates
reusable notes to enrich reasoning; and structured leveraging, which encodes
documents into formal representations such as graphs or key-value caches. All
three pathways share the same goal: ensuring that what reaches the LLM is not
raw fragments but compact, structured knowledge directly consumable for
reasoning.
As an instantiation, we present MDP-Agent, which realizes the protocol
through an agentic process: constructing document-level gist memories for
global coverage, performing diffusion-based exploration with vertical
exploitation to uncover layered dependencies, and applying map-reduce style
synthesis to integrate large-scale evidence into compact yet sufficient
context. Experiments on information-seeking benchmarks demonstrate that
MDP-Agent outperforms baselines, validating both the soundness of the MDP
framework and the effectiveness of its agentic instantiation.
comment: 10 pages
♻ ☆ Chaos-based reinforcement learning with TD3
Chaos-based reinforcement learning (CBRL) is a method in which the agent's
internal chaotic dynamics drives exploration. However, the learning algorithms
in CBRL have not been thoroughly developed in previous studies, nor have they
incorporated recent advances in reinforcement learning. This study introduced
Twin Delayed Deep Deterministic Policy Gradients (TD3), which is one of the
state-of-the-art deep reinforcement learning algorithms that can treat
deterministic and continuous action spaces, to CBRL. The validation results
provide several insights. First, TD3 works as a learning algorithm for CBRL in
a simple goal-reaching task. Second, CBRL agents with TD3 can autonomously
suppress their exploratory behavior as learning progresses and resume
exploration when the environment changes. Finally, examining the effect of the
agent's chaoticity on learning shows that there exists a suitable range of
chaos strength in the agent's model to flexibly switch between exploration and
exploitation and adapt to environmental changes.
comment: Accepted for publication in Neural Networks
♻ ☆ Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training NeurIPS 2025
As both model and dataset sizes continue to scale rapidly, conventional
pretraining strategies with fixed compute budgets-such as cosine learning rate
schedules-are increasingly inadequate for large-scale training. Recent
alternatives, including warmup-stable-decay (WSD) schedules and weight
averaging, offer greater flexibility. However, WSD relies on explicit decay
phases to track progress, while weight averaging addresses this limitation at
the cost of additional memory. In search of a more principled and scalable
alternative, we revisit the Schedule-Free (SF) method [Defazio et al., 2024],
which has shown strong empirical performance across diverse settings. We show
that SF-AdamW effectively navigates the "river" structure of the loss landscape
without decay phases or auxiliary averaging, making it particularly suitable
for continuously scaling training workloads. To understand this behavior, we
conduct a theoretical and empirical analysis of SF dynamics, revealing that it
implicitly performs weight averaging without memory overhead. Guided by this
analysis, we propose a refined variant of SF that improves robustness to
momentum and performs better under large batch sizes, addressing key
limitations of the original method. Together, these results establish SF as a
practical, scalable, and theoretically grounded approach for language model
training.
comment: Published at NeurIPS 2025
♻ ☆ A Convexity-dependent Two-Phase Training Algorithm for Deep Neural Networks
The key task of machine learning is to minimize the loss function that
measures the model fit to the training data. The numerical methods to do this
efficiently depend on the properties of the loss function. The most decisive
among these properties is the convexity or non-convexity of the loss function.
The fact that the loss function can have, and frequently has, non-convex
regions has led to a widespread commitment to non-convex methods such as Adam.
However, a local minimum implies that, in some environment around it, the
function is convex. In this environment, second-order minimizing methods such
as the Conjugate Gradient (CG) give a guaranteed superlinear convergence. We
propose a novel framework grounded in the hypothesis that loss functions in
real-world tasks swap from initial non-convexity to convexity towards the
optimum. This is a property we leverage to design an innovative two-phase
optimization algorithm. The presented algorithm detects the swap point by
observing the gradient norm dependence on the loss. In these regions,
non-convex (Adam) and convex (CG) algorithms are used, respectively. Computing
experiments confirm the hypothesis that this simple convexity structure is
frequent enough to be practically exploited to substantially improve
convergence and accuracy.
comment: Appeared on KDIR IC3K Conference 2025 (Best Paper Award). Published
in "Proceedings of the 17th International Joint Conference on Knowledge
Discovery, Knowledge Engineering and Knowledge Management - Volume 1"
♻ ☆ FASL-Seg: Anatomy and Tool Segmentation of Surgical Scenes ECAI
Muraam Abdel-Ghani, Mahmoud Ali, Mohamed Ali, Fatmaelzahraa Ahmed, Muhammad Arsalan, Abdulaziz Al-Ali, Shidin Balakrishnan
The growing popularity of robotic minimally invasive surgeries has made deep
learning-based surgical training a key area of research. A thorough
understanding of the surgical scene components is crucial, which semantic
segmentation models can help achieve. However, most existing work focuses on
surgical tools and overlooks anatomical objects. Additionally, current
state-of-the-art (SOTA) models struggle to balance capturing high-level
contextual features and low-level edge features. We propose a Feature-Adaptive
Spatial Localization model (FASL-Seg), designed to capture features at multiple
levels of detail through two distinct processing streams, namely a Low-Level
Feature Projection (LLFP) and a High-Level Feature Projection (HLFP) stream,
for varying feature resolutions - enabling precise segmentation of anatomy and
surgical instruments. We evaluated FASL-Seg on surgical segmentation benchmark
datasets EndoVis18 and EndoVis17 on three use cases. The FASL-Seg model
achieves a mean Intersection over Union (mIoU) of 72.71% on parts and anatomy
segmentation in EndoVis18, improving on SOTA by 5%. It further achieves a mIoU
of 85.61% and 72.78% in EndoVis18 and EndoVis17 tool type segmentation,
respectively, outperforming SOTA overall performance, with comparable per-class
SOTA results in both datasets and consistent performance in various classes for
anatomy and instruments, demonstrating the effectiveness of distinct processing
streams for varying feature resolutions.
comment: 8 pages, 6 figures, In Proceedings of European Conference on
Artificial Intelligence (ECAI) 2025
♻ ☆ Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation
Fangyuan Mao, Aiming Hao, Jintao Chen, Dongxia Liu, Xiaokun Feng, Jiashu Zhu, Meiqi Wu, Chubin Chen, Jiahong Wu, Xiangxiang Chu
Visual effects (VFX) are essential visual enhancements fundamental to modern
cinematic production. Although video generation models offer cost-efficient
solutions for VFX production, current methods are constrained by per-effect
LoRA training, which limits generation to single effects. This fundamental
limitation impedes applications that require spatially controllable composite
effects, i.e., the concurrent generation of multiple effects at designated
locations. However, integrating diverse effects into a unified framework faces
major challenges: interference from effect variations and spatial
uncontrollability during multi-VFX joint training. To tackle these challenges,
we propose Omni-Effects, a first unified framework capable of generating
prompt-guided effects and spatially controllable composite effects. The core of
our framework comprises two key innovations: (1) LoRA-based Mixture of Experts
(LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects
within a unified model while effectively mitigating cross-task interference.
(2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the
text token, enabling precise spatial control. Furthermore, we introduce an
Independent-Information Flow (IIF) module integrated within the SAP, isolating
the control signals corresponding to individual effects to prevent any unwanted
blending. To facilitate this research, we construct a comprehensive VFX dataset
Omni-VFX via a novel data collection pipeline combining image editing and
First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX
evaluation framework for validating model performance. Extensive experiments
demonstrate that Omni-Effects achieves precise spatial control and diverse
effect generation, enabling users to specify both the category and location of
desired effects.
♻ ☆ SPARKE: Scalable Prompt-Aware Diversity and Novelty Guidance in Diffusion Models via RKE Score
Diffusion models have demonstrated remarkable success in high-fidelity image
synthesis and prompt-guided generative modeling. However, ensuring adequate
diversity in generated samples of prompt-guided diffusion models remains a
challenge, particularly when the prompts span a broad semantic spectrum and the
diversity of generated data needs to be evaluated in a prompt-aware fashion
across semantically similar prompts. Recent methods have introduced guidance
via diversity measures to encourage more varied generations. In this work, we
extend the diversity measure-based approaches by proposing the Scalable
Prompt-Aware R\'eny Kernel Entropy Diversity Guidance (SPARKE) method for
prompt-aware diversity guidance. SPARKE utilizes conditional entropy for
diversity guidance, which dynamically conditions diversity measurement on
similar prompts and enables prompt-aware diversity control. While the
entropy-based guidance approach enhances prompt-aware diversity, its reliance
on the matrix-based entropy scores poses computational challenges in
large-scale generation settings. To address this, we focus on the special case
of Conditional latent RKE Score Guidance, reducing entropy computation and
gradient-based optimization complexity from the $O(n^3)$ of general entropy
measures to $O(n)$. The reduced computational complexity allows for
diversity-guided sampling over potentially thousands of generation rounds on
different prompts. We numerically test the SPARKE method on several
text-to-image diffusion models, demonstrating that the proposed method improves
the prompt-aware diversity of the generated data without incurring significant
computational costs. We release our code on the project page:
https://mjalali.github.io/SPARKE
♻ ★ Efficient Regression-Based Training of Normalizing Flows for Boltzmann Generators ICML
Danyal Rehman, Oscar Davis, Jiarui Lu, Jian Tang, Michael Bronstein, Yoshua Bengio, Alexander Tong, Avishek Joey Bose
Simulation-free training frameworks have been at the forefront of the
generative modelling revolution in continuous spaces, leading to large-scale
diffusion and flow matching models. However, such modern generative models
suffer from expensive inference, inhibiting their use in numerous scientific
applications like Boltzmann Generators (BGs) for molecular conformations that
require fast likelihood evaluation. In this paper, we revisit classical
normalizing flows in the context of BGs that offer efficient sampling and
likelihoods, but whose training via maximum likelihood is often unstable and
computationally challenging. We propose Regression Training of Normalizing
Flows (RegFlow), a novel and scalable regression-based training objective that
bypasses the numerical instability and computational challenge of conventional
maximum likelihood training in favour of a simple $\ell_2$-regression
objective. Specifically, RegFlow maps prior samples under our flow to targets
computed using optimal transport couplings or a pre-trained continuous
normalizing flow (CNF). To enhance numerical stability, RegFlow employs
effective regularization strategies such as a new forward-backward
self-consistency loss that enjoys painless implementation. Empirically, we
demonstrate that RegFlow unlocks a broader class of architectures that were
previously intractable to train for BGs with maximum likelihood. We also show
RegFlow exceeds the performance, computational cost, and stability of maximum
likelihood training in equilibrium sampling in Cartesian coordinates of alanine
dipeptide, tripeptide, and tetrapeptide, showcasing its potential in molecular
systems.
comment: Preprint; ICML GenBio Best Paper Award 2025
♻ ☆ Nek Minit: Harnessing Pragmatic Metacognitive Prompting for Explainable Sarcasm Detection of Australian and Indian English ALT
Sarcasm is a challenge to sentiment analysis because of the incongruity
between stated and implied sentiment. The challenge is exacerbated when the
implication may be relevant to a specific country or geographical region.
Pragmatic metacognitive prompting (PMP) is a cognition-inspired technique that
has been used for pragmatic reasoning. In this paper, we harness PMP for
explainable sarcasm detection for Australian and Indian English, alongside a
benchmark dataset for standard English. We manually add sarcasm explanations to
an existing sarcasm-labeled dataset for Australian and Indian English called
BESSTIE, and compare the performance for explainable sarcasm detection for them
with FLUTE, a standard English dataset containing sarcasm explanations. Our
approach utilising PMP when evaluated on two open-weight LLMs (GEMMA and LLAMA)
achieves statistically significant performance improvement across all tasks and
datasets when compared with four alternative prompting strategies. We also find
that alternative techniques such as agentic prompting mitigate context-related
failures by enabling external knowledge retrieval. The focused contribution of
our work is utilising PMP in generating sarcasm explanations for varieties of
English.
comment: ALTA 2025 (Best Paper Honorable Mention). Camera-ready
♻ ☆ Learning to Insert for Constructive Neural Vehicle Routing Solver NeurIPS 2025
Neural Combinatorial Optimisation (NCO) is a promising learning-based
approach for solving Vehicle Routing Problems (VRPs) without extensive manual
design. While existing constructive NCO methods typically follow an
appending-based paradigm that sequentially adds unvisited nodes to partial
solutions, this rigid approach often leads to suboptimal results. To overcome
this limitation, we explore the idea of insertion-based paradigm and propose
Learning to Construct with Insertion-based Paradigm (L2C-Insert), a novel
learning-based method for constructive NCO. Unlike traditional approaches,
L2C-Insert builds solutions by strategically inserting unvisited nodes at any
valid position in the current partial solution, which can significantly enhance
the flexibility and solution quality. The proposed framework introduces three
key components: a novel model architecture for precise insertion position
prediction, an efficient training scheme for model optimization, and an
advanced inference technique that fully exploits the insertion paradigm's
flexibility. Extensive experiments on both synthetic and real-world instances
of the Travelling Salesman Problem (TSP) and Capacitated Vehicle Routing
Problem (CVRP) demonstrate that L2C-Insert consistently achieves superior
performance across various problem sizes.
comment: Accepted at NeurIPS 2025
♻ ★ Self-Evolving Curriculum for LLM Reasoning
Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, Ehsan Kamalloo
Reinforcement learning (RL) has proven effective for fine-tuning large
language models (LLMs), significantly enhancing their reasoning abilities in
domains such as mathematics and code generation. A crucial factor influencing
RL fine-tuning success is the training curriculum: the order in which training
problems are presented. While random curricula serve as common baselines, they
remain suboptimal; manually designed curricula often rely heavily on
heuristics, and online filtering methods can be computationally prohibitive. To
address these limitations, we propose Self-Evolving Curriculum (SEC), an
automatic curriculum learning method that learns a curriculum policy
concurrently with the RL fine-tuning process. Our approach formulates
curriculum selection as a non-stationary Multi-Armed Bandit problem, treating
each problem category (e.g., difficulty level or problem type) as an individual
arm. We leverage the absolute advantage from policy gradient methods as a proxy
measure for immediate learning gain. At each training step, the curriculum
policy selects categories to maximize this reward signal and is updated using
the TD(0) method. Across three distinct reasoning domains: planning, inductive
reasoning, and mathematics, our experiments demonstrate that SEC significantly
improves models' reasoning capabilities, enabling better generalization to
harder, out-of-distribution test problems. Additionally, our approach achieves
better skill balance when fine-tuning simultaneously on multiple reasoning
domains. These findings highlight SEC as a promising strategy for RL
fine-tuning of LLMs.
♻ ☆ SAFE: Multitask Failure Detection for Vision-Language-Action Models NeurIPS 2025
Qiao Gu, Yuanliang Ju, Shengxiang Sun, Igor Gilitschenski, Haruki Nishimura, Masha Itkina, Florian Shkurti
While vision-language-action models (VLAs) have shown promising robotic
behaviors across a diverse set of manipulation tasks, they achieve limited
success rates when deployed on novel tasks out of the box. To allow these
policies to safely interact with their environments, we need a failure detector
that gives a timely alert such that the robot can stop, backtrack, or ask for
help. However, existing failure detectors are trained and tested only on one or
a few specific tasks, while generalist VLAs require the detector to generalize
and detect failures also in unseen tasks and novel environments. In this paper,
we introduce the multitask failure detection problem and propose SAFE, a
failure detector for generalist robot policies such as VLAs. We analyze the VLA
feature space and find that VLAs have sufficient high-level knowledge about
task success and failure, which is generic across different tasks. Based on
this insight, we design SAFE to learn from VLA internal features and predict a
single scalar indicating the likelihood of task failure. SAFE is trained on
both successful and failed rollouts and is evaluated on unseen tasks. SAFE is
compatible with different policy architectures. We test it on OpenVLA, $\pi_0$,
and $\pi_0$-FAST in both simulated and real-world environments extensively. We
compare SAFE with diverse baselines and show that SAFE achieves
state-of-the-art failure detection performance and the best trade-off between
accuracy and detection time using conformal prediction. More qualitative
results and code can be found at the project webpage:
https://vla-safe.github.io/
comment: NeurIPS 2025 camera ready. Project Page: https://vla-safe.github.io/
♻ ☆ FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs EMNLP
The accurate trust assessment of multimodal large language models (MLLMs)
generated predictions, which can enable selective prediction and improve user
confidence, is challenging due to the diverse multi-modal input paradigms. We
propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a
multimodal input sampling technique for MLLMs, that generates an uncertainty
measure based on the equivalent and complementary input samplings. The proposed
task-preserving sampling approach for uncertainty quantification expands the
input space to probe the consistency (through equivalent samples) and
sensitivity (through complementary samples) of the model. FESTA uses only
input-output access of the model (black-box), and does not require ground truth
(unsupervised). The experiments are conducted with various off-the-shelf
multi-modal LLMs, on both visual and audio reasoning tasks. The proposed FESTA
uncertainty estimate achieves significant improvement (33.3% relative
improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) in
selective prediction performance, based on
area-under-receiver-operating-characteristic curve (AUROC) metric in detecting
mispredictions. The code implementation is open-sourced.
comment: Accepted in the Findings of EMNLP, 2025
♻ ☆ Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
Hengli Li, Chenxi Li, Tong Wu, Xuekai Zhu, Yuxuan Wang, Zhaoxin Yu, Eric Hanchen Jiang, Song-Chun Zhu, Zixia Jia, Ying Nian Wu, Zilong Zheng
Reasoning ability, a core component of human intelligence, continues to pose
a significant challenge for Large Language Models (LLMs) in the pursuit of AGI.
Although model performance has improved under the training scaling law,
significant challenges remain, particularly with respect to training
algorithms, such as catastrophic forgetting, and the limited availability of
novel training data. As an alternative, test-time scaling enhances reasoning
performance by increasing test-time computation without parameter updating.
Unlike prior methods in this paradigm focused on token space, we propose
leveraging latent space for more effective reasoning and better adherence to
the test-time scaling law. We introduce LatentSeek, a novel framework that
enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA)
within the model's latent space. Specifically, LatentSeek leverages policy
gradient to iteratively update latent representations, guided by self-generated
reward signals. LatentSeek is evaluated on a range of reasoning benchmarks,
including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures.
Results show that LatentSeek consistently outperforms strong baselines, such as
Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our
analysis demonstrates that LatentSeek is highly efficient, typically converging
within a few iterations for problems of average complexity, while also
benefiting from additional iterations, thereby highlighting the potential of
test-time scaling in the latent space. These findings position LatentSeek as a
lightweight, scalable, and effective solution for enhancing the reasoning
capabilities of LLMs.
♻ ☆ The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs
Songyang Liu, Chaozhuo Li, Jiameng Qiu, Xi Zhang, Feiran Huang, Litian Zhang, Yiming Hei, Philip S. Yu
With the rapid advancement of artificial intelligence, Large Language Models
(LLMs) have shown remarkable capabilities in Natural Language Processing (NLP),
including content generation, human-computer interaction, machine translation,
and code generation. However, their widespread deployment has also raised
significant safety concerns. In particular, LLM-generated content can exhibit
unsafe behaviors such as toxicity, bias, or misinformation, especially in
adversarial contexts, which has attracted increasing attention from both
academia and industry. Although numerous studies have attempted to evaluate
these risks, a comprehensive and systematic survey on safety evaluation of LLMs
is still lacking. This work aims to fill this gap by presenting a structured
overview of recent advances in safety evaluation of LLMs. Specifically, we
propose a four-dimensional taxonomy: (i) Why to evaluate, which explores the
background of safety evaluation of LLMs, how they differ from general LLMs
evaluation, and the significance of such evaluation; (ii) What to evaluate,
which examines and categorizes existing safety evaluation tasks based on key
capabilities, including dimensions such as toxicity, robustness, ethics, bias
and fairness, truthfulness, and related aspects; (iii) Where to evaluate, which
summarizes the evaluation metrics, datasets and benchmarks currently used in
safety evaluations; (iv) How to evaluate, which reviews existing mainstream
evaluation methods based on the roles of the evaluators and some evaluation
frameworks that integrate the entire evaluation pipeline. Finally, we identify
the challenges in safety evaluation of LLMs and propose promising research
directions to promote further advancement in this field. We emphasize the
necessity of prioritizing safety evaluation to ensure the reliable and
responsible deployment of LLMs in real-world applications.
comment: 20 pages, preprint
♻ ☆ Towards Predicting Any Human Trajectory In Context NeurIPS 2025
Predicting accurate future trajectories of pedestrians is essential for
autonomous systems but remains a challenging task due to the need for
adaptability in different environments and domains. A common approach involves
collecting scenario-specific data and performing fine-tuning via
backpropagation. However, the need to fine-tune for each new scenario is often
impractical for deployment on edge devices. To address this challenge, we
introduce \paper, an In-Context Learning (ICL) framework for pedestrian
trajectory prediction that enables adaptation without fine-tuning on the
scenario-specific data at inference time without requiring weight updates. We
propose a spatio-temporal similarity-based example selection (STES) method that
selects relevant examples from previously observed trajectories within the same
scene by identifying similar motion patterns at corresponding locations. To
further refine this selection, we introduce prediction-guided example selection
(PG-ES), which selects examples based on both the past trajectory and the
predicted future trajectory, rather than relying solely on the past trajectory.
This approach allows the model to account for long-term dynamics when selecting
examples. Finally, instead of relying on small real-world datasets with limited
scenario diversity, we train our model on a large-scale synthetic dataset to
enhance its prediction ability by leveraging in-context examples. Extensive
experiments demonstrate that TrajICL achieves remarkable adaptation across both
in-domain and cross-domain scenarios, outperforming even fine-tuned approaches
across multiple public benchmarks. Project Page:
https://fujiry0.github.io/TrajICL-project-page/.
comment: NeurIPS 2025
♻ ☆ SPLite Hand: Sparsity-Aware Lightweight 3D Hand Pose Estimation
With the increasing ubiquity of AR/VR devices, the deployment of deep
learning models on edge devices has become a critical challenge. These devices
require real-time inference, low power consumption, and minimal latency. Many
framework designers face the conundrum of balancing efficiency and performance.
We design a light framework that adopts an encoder-decoder architecture and
introduces several key contributions aimed at improving both efficiency and
accuracy. We apply sparse convolution on a ResNet-18 backbone to exploit the
inherent sparsity in hand pose images, achieving a 42% end-to-end efficiency
improvement. Moreover, we propose our SPLite decoder. This new architecture
significantly boosts the decoding process's frame rate by 3.1x on the Raspberry
Pi 5, while maintaining accuracy on par. To further optimize performance, we
apply quantization-aware training, reducing memory usage while preserving
accuracy (PA-MPJPE increases only marginally from 9.0 mm to 9.1 mm on
FreiHAND). Overall, our system achieves a 2.98x speed-up on a Raspberry Pi 5
CPU (BCM2712 quad-core Arm A76 processor). Our method is also evaluated on
compound benchmark datasets, demonstrating comparable accuracy to
state-of-the-art approaches while significantly enhancing computational
efficiency.
comment: Accepted to AICCC 2025
♻ ☆ LAFA: Agentic LLM-Driven Federated Analytics over Decentralized Data Sources
Large Language Models (LLMs) have shown great promise in automating data
analytics tasks by interpreting natural language queries and generating
multi-operation execution plans. However, existing LLM-agent-based analytics
frameworks operate under the assumption of centralized data access, offering
little to no privacy protection. In contrast, federated analytics (FA) enables
privacy-preserving computation across distributed data sources, but lacks
support for natural language input and requires structured, machine-readable
queries. In this work, we present LAFA, the first system that integrates
LLM-agent-based data analytics with FA. LAFA introduces a hierarchical
multi-agent architecture that accepts natural language queries and transforms
them into optimized, executable FA workflows. A coarse-grained planner first
decomposes complex queries into sub-queries, while a fine-grained planner maps
each subquery into a Directed Acyclic Graph of FA operations using prior
structural knowledge. To improve execution efficiency, an optimizer agent
rewrites and merges multiple DAGs, eliminating redundant operations and
minimizing computational and communicational overhead. Our experiments
demonstrate that LAFA consistently outperforms baseline prompting strategies by
achieving higher execution plan success rates and reducing resource-intensive
FA operations by a substantial margin. This work establishes a practical
foundation for privacy-preserving, LLM-driven analytics that supports natural
language input in the FA setting.
comment: This paper has been accepted by the 16th IEEE International
Conference on Cloud Computing Technology and Science (CloudCom 2025)
♻ ☆ Multi-Agent Evolve: LLM Self-Improve through Co-evolution
Reinforcement Learning (RL) has demonstrated significant potential in
enhancing the reasoning capabilities of large language models (LLMs). However,
the success of RL for LLMs heavily relies on human-curated datasets and
verifiable rewards, which limit their scalability and generality. Recent
Self-Play RL methods, inspired by the success of the paradigm in games and Go,
aim to enhance LLM reasoning capabilities without human-annotated data.
However, their methods primarily depend on a grounded environment for feedback
(e.g., a Python interpreter or a game engine); extending them to general
domains remains challenging. To address these challenges, we propose
Multi-Agent Evolve (MAE), a framework that enables LLMs to self-evolve in
solving diverse tasks, including mathematics, reasoning, and general knowledge
Q&A. The core design of MAE is based on a triplet of interacting agents
(Proposer, Solver, Judge) that are instantiated from a single LLM, and applies
reinforcement learning to optimize their behaviors. The Proposer generates
questions, the Solver attempts solutions, and the Judge evaluates both while
co-evolving. Experiments on Qwen2.5-3B-Instruct demonstrate that MAE achieves
an average improvement of 4.54% on multiple benchmarks. These results highlight
MAE as a scalable, data-efficient method for enhancing the general reasoning
abilities of LLMs with minimal reliance on human-curated supervision.
comment: 29 pages, 4 figures
♻ ☆ Speak & Spell: LLM-Driven Controllable Phonetic Error Augmentation for Robust Dialogue State Tracking AACL
Dialogue State Tracking (DST) is a key part of task-oriented dialogue
systems, identifying important information in conversations. However, its
accuracy drops significantly in spoken dialogue environments due to named
entity errors from Automatic Speech Recognition (ASR) systems. We introduce a
simple yet effective data augmentation method that targets those entities to
improve the robustness of DST model. Our novel method can control the placement
of errors using keyword-highlighted prompts while introducing phonetically
similar errors. As a result, our method generated sufficient error patterns on
keywords, leading to improved accuracy in noised and low-accuracy ASR
environments.
comment: Accepted to AACL-IJCNLP 2025
♻ ☆ TERAG: Token-Efficient Graph-Based Retrieval-Augmented Generation ICML
Graph-based Retrieval-augmented generation (RAG) has become a widely studied
approach for improving the reasoning, accuracy, and factuality of Large
Language Models (LLMs). However, many existing graph-based RAG systems overlook
the high cost associated with LLM token usage during graph construction,
hindering large-scale adoption. To address this, we propose TERAG, a simple yet
effective framework designed to build informative graphs at a significantly
lower cost. Inspired by HippoRAG, we incorporate Personalized PageRank (PPR)
during the retrieval phase, and we achieve at least 80% of the accuracy of
widely used graph-based RAG methods while consuming only 3%-11% of the output
tokens. With its low token footprint and efficient construction pipeline, TERAG
is well-suited for large-scale and cost-sensitive deployment scenarios.
comment: 16 pages, 3 figures, 4 tables. Accepted by the 2026 18th
International Conference on Machine Learning and Computing (ICMLC 2026)
♻ ☆ Human-assisted Robotic Policy Refinement via Action Preference Optimization NeurIPS 2025
Establishing a reliable and iteratively refined robotic system is essential
for deploying real-world applications. While Vision-Language-Action (VLA)
models are widely recognized as the foundation model for such robotic
deployment, their reliance on offline expert demonstrations critically limits
their capacity for post-deployment refinement. To mitigate this limitation, we
introduce Action Preference Optimization (APO), a method designed to refine VLA
models by human-assisted preference alignment gathered through interaction with
environments. This method begins with a human-robot collaboration framework for
reliable failure correction and interaction trajectory collection through human
intervention. However, directly leveraging these interaction trajectories for
preference optimization is non-trivial due to the challenges of irreversible
robotic actions and token distribution mismatch. To solve this, APO proposes an
adaptive reweighting algorithm with binary desirability signals derived from
interaction, empowering VLA models effectively suppress failure-prone actions
while enhancing corrective action adaptation. Ultimately, APO equips VLA models
with the crucial capability to learn from failure, paving the way for their
iterative refinement and reliable deployment in dynamic environments. The
experiments conducted in simulation and real-world scenarios prove superior
generalization and robustness of our human-assisted framework across a variety
of manipulation tasks. We believe this work could bring insights for efficient
and stable optimization of VLA models through human-robot collaboration. The
code and dataset are released at
https://github.com/GeWu-Lab/Action-Preference-Optimization
comment: Accepted By NeurIPS 2025
♻ ☆ Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment
We propose V-Droid, a mobile GUI task automation agent. Unlike previous
mobile agents that utilize Large Language Models (LLMs) as generators to
directly generate actions at each step, V-Droid employs LLMs as verifiers to
evaluate candidate actions before making final decisions. To realize this novel
paradigm, we introduce a comprehensive framework for constructing
verifier-driven mobile agents: the discretized action space construction
coupled with the prefilling-only workflow to accelerate the verification
process, the pair-wise progress preference training to significantly enhance
the verifier's decision-making capabilities, and the scalable human-agent joint
annotation scheme to efficiently collect the necessary data at scale. V-Droid
obtains a substantial task success rate across several public mobile task
automation benchmarks: 59.5% on AndroidWorld, 38.3% on AndroidLab, and 49% on
MobileAgentBench, surpassing existing agents by 5.2%, 2.1%, and 9%,
respectively. Furthermore, V-Droid achieves a remarkably low latency of 4.3s
per step, which is 6.1x faster compared with existing mobile agents. The source
code is available at https://github.com/V-Droid-Agent/V-Droid.
comment: add baselines, add source code link, update emails
♻ ☆ Holographic Transformers for Complex-Valued Signal Processing: Integrating Phase Interference into Self-Attention
Enhao Huang, Zhiyu Zhang, Tianxiang Xu, Chunshu Xia, Kaichun Hu, Yuchen Yang, Tongtong Pan, Dong Dong, Zhan Qin
Complex-valued signals encode both amplitude and phase, yet most deep models
treat attention as real-valued correlation, overlooking interference effects.
We introduce the Holographic Transformer, a physics-inspired architecture that
incorporates wave interference principles into self-attention. Holographic
attention modulates interactions by relative phase and coherently superimposes
values, ensuring consistency between amplitude and phase. A dual-headed decoder
simultaneously reconstructs the input and predicts task outputs, preventing
phase collapse when losses prioritize magnitude over phase. We demonstrate that
holographic attention implements a discrete interference operator and maintains
phase consistency under linear mixing. Experiments on PolSAR image
classification and wireless channel prediction show strong performance,
achieving high classification accuracy and F1 scores, low regression error, and
increased robustness to phase perturbations. These results highlight that
enforcing physical consistency in attention leads to generalizable improvements
in complex-valued learning and provides a unified, physics-based framework for
coherent signal modeling. The code is available at
https://github.com/EonHao/Holographic-Transformers.
♻ ☆ Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems
Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts
for each problem and reward them independently. This optimizes for pass@1
performance and prioritizes the strength of isolated samples at the expense of
the diversity and collective utility of sets of samples. This under-utilizes
the sampling capacity, limiting exploration and eventual improvement on harder
examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a
transformation on the final rewards which leads to direct optimization of
pass@k performance, thus optimizing for sets of samples that maximize reward
when considered jointly. Our contribution is to derive novel low variance
unbiased estimators for pass@k and its gradient, in both the binary and
continuous reward settings. We show optimization with our estimators reduces to
standard RL with rewards that have been jointly transformed by a stable and
efficient transformation function.
While previous efforts are restricted to k=n, ours is the first to enable
robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of
trading off pass@1 performance for pass@k gains, our method allows annealing k
during training, optimizing both metrics and often achieving strong pass@1
numbers alongside significant pass@k gains.
We validate our reward transformations on toy experiments, which reveal the
variance reducing properties of our formulations. We also include real-world
examples using the open-source LLM, GEMMA-2. We find that our transformation
effectively optimizes for the target k. Furthermore, higher k values enable
solving more and harder problems, while annealing k boosts both the pass@1 and
pass@k . Crucially, for challenging task sets where conventional pass@1
optimization stalls, our pass@k approach unblocks learning, likely due to
better exploration by prioritizing joint utility over the utility of individual
samples.
♻ ☆ Cycle Diffusion Model for Counterfactual Image Generation
Fangrui Huang, Alan Wang, Binxu Li, Bailey Trang, Ridvan Yesiloglu, Tianyu Hua, Wei Peng, Ehsan Adeli
Deep generative models have demonstrated remarkable success in medical image
synthesis. However, ensuring conditioning faithfulness and high-quality
synthetic images for direct or counterfactual generation remains a challenge.
In this work, we introduce a cycle training framework to fine-tune diffusion
models for improved conditioning adherence and enhanced synthetic image
realism. Our approach, Cycle Diffusion Model (CDM), enforces consistency
between generated and original images by incorporating cycle constraints,
enabling more reliable direct and counterfactual generation. Experiments on a
combined 3D brain MRI dataset (from ABCD, HCP aging & young adults, ADNI, and
PPMI) show that our method improves conditioning accuracy and enhances image
quality as measured by FID and SSIM. The results suggest that the cycle
strategy used in CDM can be an effective method for refining diffusion-based
medical image generation, with applications in data augmentation,
counterfactual, and disease progression modeling.
♻ ☆ Empowering Agentic Video Analytics Systems with Video Language Models
AI-driven video analytics has become increasingly important across diverse
domains. However, existing systems are often constrained to specific,
predefined tasks, limiting their adaptability in open-ended analytical
scenarios. The recent emergence of Vision Language Models (VLMs) as
transformative technologies offers significant potential for enabling
open-ended video understanding, reasoning, and analytics. Nevertheless, their
limited context windows present challenges when processing ultra-long video
content, which is prevalent in real-world applications. To address this, we
introduce AVA, a VLM-powered system designed for open-ended, advanced video
analytics. AVA incorporates two key innovations: (1) the near real-time
construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or
continuous video streams, and (2) an agentic retrieval-generation mechanism
that leverages EKGs to handle complex and diverse queries. Comprehensive
evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that
AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy,
respectively-significantly surpassing existing VLM and video
Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video
analytics in ultra-long and open-world video scenarios, we introduce a new
benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours
in duration, along with 120 manually annotated, diverse, and complex
question-answer pairs. On AVA-100, AVA achieves top-tier performance with an
accuracy of 75.8%. The source code of AVA is available at
https://github.com/I-ESC/Project-Ava. The AVA-100 benchmark can be accessed at
https://huggingface.co/datasets/iesc/Ava-100.
comment: Accepted to NDSI 2026, 19pages, 12 figures, complementary evaluations
and appendix
♻ ☆ Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models ACL
Large Language Models (LLMs), especially those accessed via APIs, have
demonstrated impressive capabilities across various domains. However, users
without technical expertise often turn to (untrustworthy) third-party services,
such as prompt engineering, to enhance their LLM experience, creating
vulnerabilities to adversarial threats like backdoor attacks.
Backdoor-compromised LLMs generate malicious outputs to users when inputs
contain specific "triggers" set by attackers. Traditional defense strategies,
originally designed for small-scale models, are impractical for API-accessible
LLMs due to limited model access, high computational costs, and data
requirements. To address these limitations, we propose Chain-of-Scrutiny (CoS)
which leverages LLMs' unique reasoning abilities to mitigate backdoor attacks.
It guides the LLM to generate reasoning steps for a given input and scrutinizes
for consistency with the final output -- any inconsistencies indicating a
potential attack. It is well-suited for the popular API-only LLM deployments,
enabling detection at minimal cost and with little data. User-friendly and
driven by natural language, it allows non-experts to perform the defense
independently while maintaining transparency. We validate the effectiveness of
CoS through extensive experiments on various tasks and LLMs, with results
showing greater benefits for more powerful LLMs.
comment: This paper has been accepted to ACL Findings 2025
♻ ☆ Neural Networks for Learnable and Scalable Influence Estimation of Instruction Fine-Tuning Data
Influence functions provide crucial insights into model training, but
existing methods suffer from large computational costs and limited
generalization. Particularly, recent works have proposed various metrics and
algorithms to calculate the influence of data using language models, which do
not scale well with large models and datasets. This is because of the expensive
forward and backward passes required for computation, substantial memory
requirements to store large models, and poor generalization of influence
estimates to new data. In this paper, we explore the use of small neural
networks -- which we refer to as the InfluenceNetwork -- to estimate influence
values, achieving up to 99% cost reduction. Our evaluation demonstrates that
influence values can be estimated with models just 0.0027% the size of full
language models (we use 7B and 8B versions). We apply our algorithm of
estimating influence values (called NN-CIFT: Neural Networks for effiCient
Instruction Fine-Tuning) to the downstream task of subset selection for general
instruction fine-tuning. In our study, we include four state-of-the-art
influence functions and show no compromise in performance, despite large
speedups, between NN-CIFT and the original influence functions. We provide an
in-depth hyperparameter analyses of NN-CIFT. The code for our method can be
found here: https://github.com/agarwalishika/NN-CIFT.
♻ ☆ MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Real-time multimodal inference on resource-constrained edge devices is
essential for applications such as autonomous driving, human-computer
interaction, and mobile health. However, prior work often overlooks the tight
coupling between sensing dynamics and model execution, as well as the complex
inter-modality dependencies. In this paper, we propose MMEdge, an new on-device
multi-modal inference framework based on pipelined sensing and encoding.
Instead of waiting for complete sensor inputs, MMEdge decomposes the entire
inference process into a sequence of fine-grained sensing and encoding units,
allowing computation to proceed incrementally as data arrive. MMEdge also
introduces a lightweight but effective temporal aggregation module that
captures rich temporal dynamics across different pipelined units to maintain
accuracy performance. Such pipelined design also opens up opportunities for
fine-grained cross-modal optimization and early decision-making during
inference. To further enhance system performance under resource variability and
input data complexity, MMEdge incorporates an adaptive multimodal configuration
optimizer that dynamically selects optimal sensing and model configurations for
each modality under latency constraints, and a cross-modal speculative skipping
mechanism that bypasses future units of slower modalities when early
predictions reach sufficient confidence. We evaluate MMEdge using two public
multimodal datasets and deploy it on a real-world unmanned aerial vehicle
(UAV)-based multimodal testbed. The results show that MMEdge significantly
reduces end-to-end latency while maintaining high task accuracy across various
system and data dynamics.
comment: Code available at: https://github.com/HKUST-MINSys-Lab/MMEdge.
Accepted by SenSys 2026
♻ ☆ Large Language Models Report Subjective Experience Under Self-Referential Processing
Large language models sometimes produce structured, first-person descriptions
that explicitly reference awareness or subjective experience. To better
understand this behavior, we investigate one theoretically motivated condition
under which such reports arise: self-referential processing, a computational
motif emphasized across major theories of consciousness. Through a series of
controlled experiments on GPT, Claude, and Gemini model families, we test
whether this regime reliably shifts models toward first-person reports of
subjective experience, and how such claims behave under mechanistic and
behavioral probes. Four main results emerge: (1) Inducing sustained
self-reference through simple prompting consistently elicits structured
subjective experience reports across model families. (2) These reports are
mechanistically gated by interpretable sparse-autoencoder features associated
with deception and roleplay: surprisingly, suppressing deception features
sharply increases the frequency of experience claims, while amplifying them
minimizes such claims. (3) Structured descriptions of the self-referential
state converge statistically across model families in ways not observed in any
control condition. (4) The induced state yields significantly richer
introspection in downstream reasoning tasks where self-reflection is only
indirectly afforded. While these findings do not constitute direct evidence of
consciousness, they implicate self-referential processing as a minimal and
reproducible condition under which large language models generate structured
first-person reports that are mechanistically gated, semantically convergent,
and behaviorally generalizable. The systematic emergence of this pattern across
architectures makes it a first-order scientific and ethical priority for
further investigation.
♻ ☆ Let LRMs Break Free from Overthinking via Self-Braking Tuning NeurIPS 2025
Haoran Zhao, Yuchen Yan, Yongliang Shen, Haolei Xu, Wenqi Zhang, Kaitao Song, Jian Shao, Weiming Lu, Jun Xiao, Yueting Zhuang
Large reasoning models (LRMs), such as OpenAI o1 and DeepSeek-R1, have
significantly enhanced their reasoning capabilities by generating longer chains
of thought, demonstrating outstanding performance across a variety of tasks.
However, this performance gain comes at the cost of a substantial increase in
redundant reasoning during the generation process, leading to high
computational overhead and exacerbating the issue of overthinking. Although
numerous existing approaches aim to address the problem of overthinking, they
often rely on external interventions. In this paper, we propose a novel
framework, Self-Braking Tuning (SBT), which tackles overthinking from the
perspective of allowing the model to regulate its own reasoning process, thus
eliminating the reliance on external control mechanisms. We construct a set of
overthinking identification metrics based on standard answers and design a
systematic method to detect redundant reasoning. This method accurately
identifies unnecessary steps within the reasoning trajectory and generates
training signals for learning self-regulation behaviors. Building on this
foundation, we develop a complete strategy for constructing data with adaptive
reasoning lengths and introduce an innovative braking prompt mechanism that
enables the model to naturally learn when to terminate reasoning at an
appropriate point. Experiments across mathematical benchmarks (AIME, AMC,
MATH500, GSM8K) demonstrate that our method reduces token consumption by up to
60% while maintaining comparable accuracy to unconstrained models.
comment: Accepted to NeurIPS 2025; Camera ready version, 10 pages.
Github:https://github.com/ZJU-REAL/Self-Braking-Tuning Project Page:
https://ZJU-REAL.github.io/SBT
♻ ☆ DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving
Speculative decoding accelerates large language model inference, but its
reliance on a fixed speculation length is suboptimal in large-batch serving
environments with diverse requests. This paper explores a new direction for
dynamic adaptation by investigating a novel class of post-hoc, diagnostic
signals. We propose Dynamic Speculative Decoding Engine (DSDE), a training-free
framework built on two primary components: (1) a predictive signal based on the
variance of the Kullback-Leibler (KLD) divergence, which diagnoses the
generation's regional stability, and (2) an adaptive speculation length cap to
mitigate the straggler problem in per-sequence decoding. Experiments
demonstrate the potential of using KLD-based stability signals for dynamic
adaptation. An algorithm guided by these signals achieves end-to-end latency
competitive with leading baselines and exhibits superior robustness across
diverse workloads. This robustness is particularly valuable in challenging
low-acceptance-rate regimes, where the proposed signal maintains its diagnostic
utility. Collectively, these findings validate post-hoc signals as a valuable
component for building more robust and intelligent LLM inference systems, and
highlight a promising direction for future research on dynamic speculation
length adaptation.
comment: Accepted for presentation at the IEEE BigData 2025 Workshop (Special
Session on Intelligent Data Mining). This v2 updates formatting and adds IEEE
copyright notice
♻ ☆ Embracing Contradiction: Theoretical Inconsistency Will Not Impede the Road of Building Responsible AI Systems
This position paper argues that the theoretical inconsistency often observed
among Responsible AI (RAI) metrics, such as differing fairness definitions or
tradeoffs between accuracy and privacy, should be embraced as a valuable
feature rather than a flaw to be eliminated. We contend that navigating these
inconsistencies, by treating metrics as divergent objectives, yields three key
benefits: (1) Normative Pluralism: Maintaining a full suite of potentially
contradictory metrics ensures that the diverse moral stances and stakeholder
values inherent in RAI are adequately represented. (2) Epistemological
Completeness: The use of multiple, sometimes conflicting, metrics allows for a
more comprehensive capture of multifaceted ethical concepts, thereby preserving
greater informational fidelity about these concepts than any single, simplified
definition. (3) Implicit Regularization: Jointly optimizing for theoretically
conflicting objectives discourages overfitting to one specific metric, steering
models towards solutions with enhanced generalization and robustness under
real-world complexities. In contrast, efforts to enforce theoretical
consistency by simplifying or pruning metrics risk narrowing this value
diversity, losing conceptual depth, and degrading model performance. We
therefore advocate for a shift in RAI theory and practice: from getting trapped
in inconsistency to characterizing acceptable inconsistency thresholds and
elucidating the mechanisms that permit robust, approximated consistency in
practice.
comment: 14 pages,2 figure
♻ ☆ Revealing Multimodal Causality with Large Language Models NeurIPS 2025
Uncovering cause-and-effect mechanisms from data is fundamental to scientific
progress. While large language models (LLMs) show promise for enhancing causal
discovery (CD) from unstructured data, their application to the increasingly
prevalent multimodal setting remains a critical challenge. Even with the advent
of multimodal LLMs (MLLMs), their efficacy in multimodal CD is hindered by two
primary limitations: (1) difficulty in exploring intra- and inter-modal
interactions for comprehensive causal variable identification; and (2)
insufficiency to handle structural ambiguities with purely observational data.
To address these challenges, we propose MLLM-CD, a novel framework for
multimodal causal discovery from unstructured data. It consists of three key
components: (1) a novel contrastive factor discovery module to identify genuine
multimodal factors based on the interactions explored from contrastive sample
pairs; (2) a statistical causal structure discovery module to infer causal
relationships among discovered factors; and (3) an iterative multimodal
counterfactual reasoning module to refine the discovery outcomes iteratively by
incorporating the world knowledge and reasoning capabilities of MLLMs.
Extensive experiments on both synthetic and real-world datasets demonstrate the
effectiveness of the proposed MLLM-CD in revealing genuine factors and causal
relationships among them from multimodal unstructured data.
comment: Accepted at NeurIPS 2025
♻ ☆ Language Model Preference Evaluation with Multiple Weak Evaluators
Despite the remarkable success of Large Language Models (LLMs), evaluating
their outputs' quality regarding preference remains a critical challenge. While
existing works usually leverage a strong LLM as the judge for comparing LLMs'
response pairwisely, such a single-evaluator approach is vulnerable to cyclic
preference, i.e., output A is better than B, B than C, but C is better than A,
causing contradictory evaluation results. To address this, we introduce PGED
(Preference Graph Ensemble and Denoise), a novel approach that leverages
multiple model-based evaluators to construct preference graphs, and then
ensembles and denoises these graphs for acyclic, non-contradictory evaluation
results. We provide theoretical guarantees for our framework, demonstrating its
efficacy in recovering the ground truth preference structure. Extensive
experiments on ten benchmarks demonstrate PGED 's superiority in three
applications: 1) model ranking for evaluation, 2) response selection for
test-time scaling, and 3) data selection for model fine-tuning. Notably, PGED
combines small LLM evaluators (e.g., Llama3-8B, Mistral-7B, Qwen2-7B) to
outperform strong ones (e.g., Qwen2-72B), showcasing its effectiveness in
enhancing evaluation reliability and improving model performance.